All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joel Becker <Joel.Becker@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	ocfs2-devel@oss.oracle.com, Tao Ma <tao.ma@oracle.com>,
	Dave Chinner <dchinner@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com>
Subject: Re: [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF"
Date: Mon, 28 Jun 2010 17:54:04 -0700	[thread overview]
Message-ID: <20100629005403.GC24343@mail.oracle.com> (raw)
In-Reply-To: <20100629002421.GY6590@dastard>

On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote:
> On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote:
> > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.
> 
> Hi Joel,
> 
> I have no problems with it being reverted - it's a really just a
> WAR for the simplest case of the sync hold holdoff.

	I have to insist that we revert it until we find a way to make
ocfs2 work.  The rest of the email will discuss the ocfs2 issues
therein.

> > This patch causes any filesystem with an allocation unit larger than the
> > filesystem blocksize will leak unzeroed data. During a file extend, the
> > entire allocation unit is zeroed.
> 
> XFS has this same underlying issue - it can have uninitialised,
> allocated blocks past EOF that have to be zeroed when extending the
> file.

	Does XFS do this in get_blocks()?  We deliberately do no
allocation in get_blocks(), which is where our need for up-front
allocation comes from.

> > However, this patch prevents the tail
> > blocks of the allocation unit from being written back to disk.  When the
> > file is next extended, i_size will now cover these unzeroed blocks,
> > leaking the old contents of the disk to userspace and creating a corrupt
> > file.
> 
> XFS doesn't zero blocks at allocation. Instead, XFS zeros the range
> between the old EOF and the new EOF on each extending write. Hence
> these pages get written because they fall inside the new i_size that
> is set during the write.  The i_size on disk doesn't get changed
> until after the data writes have completed, so even on a crash we
> don't expose uninitialised blocks.

	We do the same, but we zero the entire allocation.  This works
both when filling holes and when extending, though obviously the
extending is what we're worried about here.  We change i_size in
write_end, so our guarantee is the same as yours for the page containing
i_size.

> Looking at ocfs2_writepage(), it simply calls
> block_write_full_page(), which does:
> 
> 	/* Is the page fully outside i_size? (truncate in progress) */
> 	offset = i_size & (PAGE_CACHE_SIZE-1);
> 	if (page->index >= end_index+1 || !offset) {
> 		/*
> 		 * The page may have dirty, unmapped buffers.  For example,
> 		 * they may have been added in ext3_writepage().  Make them
> 		 * freeable here, so the page does not leak.
> 		 */
> 		do_invalidatepage(page, 0);
> 		unlock_page(page);
> 		return 0; /* don't care */
> 	}
> 
> i.e. pages beyond EOF get invalidated.  If it somehow gets through
> that check, __block_write_full_page() will avoid writing dirty
> bufferheads beyond EOF because the write is "racing with truncate".

	Your contention is that we've never gotten those tail blocks to
disk.  Instead, our code either handles the future extensions of i_size
or we've just gotten lucky with our testing.  Our current BUG trigger is
because we have a new check that catches this case.  Does that summarize
your position correctly?
	I'm not averse to having a zero-only-till-i_size policy, but I
know we've visited this problem before and got bit.  I have to go reload
that context.
	Regarding XFS, how do you handle catching the tail of an
allocation with an lseek(2)'d write?  That is, your current allocation
has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
and write there.  The code has to recognize to zero around old_i_size
before moving out to new_i_size, right?  I think that's where our old
approaches had problems.

Joel

-- 

"The real reason GNU ls is 8-bit-clean is so that they can
 start using ISO-8859-1 option characters."
	- Christopher Davis (ckd@loiosh.kei.com)

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

WARNING: multiple messages have this Message-ID (diff)
From: Joel Becker <Joel.Becker@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	ocfs2-devel@oss.oracle.com, Tao Ma <tao.ma@oracle.com>,
	Dave Chinner <dchinner@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com>
Subject: [Ocfs2-devel] [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF"
Date: Mon, 28 Jun 2010 17:54:04 -0700	[thread overview]
Message-ID: <20100629005403.GC24343@mail.oracle.com> (raw)
In-Reply-To: <20100629002421.GY6590@dastard>

On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote:
> On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote:
> > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c.
> 
> Hi Joel,
> 
> I have no problems with it being reverted - it's a really just a
> WAR for the simplest case of the sync hold holdoff.

	I have to insist that we revert it until we find a way to make
ocfs2 work.  The rest of the email will discuss the ocfs2 issues
therein.

> > This patch causes any filesystem with an allocation unit larger than the
> > filesystem blocksize will leak unzeroed data. During a file extend, the
> > entire allocation unit is zeroed.
> 
> XFS has this same underlying issue - it can have uninitialised,
> allocated blocks past EOF that have to be zeroed when extending the
> file.

	Does XFS do this in get_blocks()?  We deliberately do no
allocation in get_blocks(), which is where our need for up-front
allocation comes from.

> > However, this patch prevents the tail
> > blocks of the allocation unit from being written back to disk.  When the
> > file is next extended, i_size will now cover these unzeroed blocks,
> > leaking the old contents of the disk to userspace and creating a corrupt
> > file.
> 
> XFS doesn't zero blocks at allocation. Instead, XFS zeros the range
> between the old EOF and the new EOF on each extending write. Hence
> these pages get written because they fall inside the new i_size that
> is set during the write.  The i_size on disk doesn't get changed
> until after the data writes have completed, so even on a crash we
> don't expose uninitialised blocks.

	We do the same, but we zero the entire allocation.  This works
both when filling holes and when extending, though obviously the
extending is what we're worried about here.  We change i_size in
write_end, so our guarantee is the same as yours for the page containing
i_size.

> Looking at ocfs2_writepage(), it simply calls
> block_write_full_page(), which does:
> 
> 	/* Is the page fully outside i_size? (truncate in progress) */
> 	offset = i_size & (PAGE_CACHE_SIZE-1);
> 	if (page->index >= end_index+1 || !offset) {
> 		/*
> 		 * The page may have dirty, unmapped buffers.  For example,
> 		 * they may have been added in ext3_writepage().  Make them
> 		 * freeable here, so the page does not leak.
> 		 */
> 		do_invalidatepage(page, 0);
> 		unlock_page(page);
> 		return 0; /* don't care */
> 	}
> 
> i.e. pages beyond EOF get invalidated.  If it somehow gets through
> that check, __block_write_full_page() will avoid writing dirty
> bufferheads beyond EOF because the write is "racing with truncate".

	Your contention is that we've never gotten those tail blocks to
disk.  Instead, our code either handles the future extensions of i_size
or we've just gotten lucky with our testing.  Our current BUG trigger is
because we have a new check that catches this case.  Does that summarize
your position correctly?
	I'm not averse to having a zero-only-till-i_size policy, but I
know we've visited this problem before and got bit.  I have to go reload
that context.
	Regarding XFS, how do you handle catching the tail of an
allocation with an lseek(2)'d write?  That is, your current allocation
has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF
and write there.  The code has to recognize to zero around old_i_size
before moving out to new_i_size, right?  I think that's where our old
approaches had problems.

Joel

-- 

"The real reason GNU ls is 8-bit-clean is so that they can
 start using ISO-8859-1 option characters."
	- Christopher Davis (ckd at loiosh.kei.com)

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

  reply	other threads:[~2010-06-29  0:56 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-28 17:35 [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF" Joel Becker
2010-06-28 17:35 ` [Ocfs2-devel] " Joel Becker
2010-06-29  0:24 ` Dave Chinner
2010-06-29  0:24   ` [Ocfs2-devel] " Dave Chinner
2010-06-29  0:54   ` Joel Becker [this message]
2010-06-29  0:54     ` Joel Becker
2010-06-29  1:12     ` Linus Torvalds
2010-06-29  1:12       ` [Ocfs2-devel] " Linus Torvalds
2010-06-29  1:58       ` Joel Becker
2010-06-29  1:58         ` Joel Becker
2010-06-29  2:20         ` Linus Torvalds
2010-06-29  2:20           ` Linus Torvalds
2010-06-29  2:44           ` Dave Chinner
2010-06-29  2:44             ` Dave Chinner
2010-06-29  8:16           ` Joel Becker
2010-06-29  8:16             ` Joel Becker
2010-06-30  1:30             ` Joel Becker
2010-06-30  1:30               ` Joel Becker
2010-07-06 19:06         ` Joel Becker
2010-07-06 19:06           ` Joel Becker
2010-06-29  1:56     ` Dave Chinner
2010-06-29  1:56       ` [Ocfs2-devel] " Dave Chinner
2010-06-29  2:04       ` Joel Becker
2010-06-29  2:04         ` [Ocfs2-devel] " Joel Becker
2010-06-29  2:27         ` Dave Chinner
2010-06-29  2:27           ` [Ocfs2-devel] " Dave Chinner
2010-06-29  7:18           ` Joel Becker
2010-06-29  7:18             ` [Ocfs2-devel] " Joel Becker
2010-07-02 22:49             ` [PATCH] ocfs2: Zero the tail cluster when extending past i_size Joel Becker
2010-07-02 22:49               ` [Ocfs2-devel] " Joel Becker
2010-07-03 21:32               ` [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2 Joel Becker
2010-07-03 21:32                 ` [Ocfs2-devel] " Joel Becker
2010-07-03 21:33                 ` [PATCH 2/2] ocfs2: No need to zero pages past i_size. " Joel Becker
2010-07-03 21:33                   ` [Ocfs2-devel] " Joel Becker
2010-07-04 15:13                   ` Tao Ma
2010-07-04 15:13                     ` [Ocfs2-devel] " Tao Ma
2010-07-05  1:38                     ` Tao Ma
2010-07-05  1:38                       ` [Ocfs2-devel] " Tao Ma
2010-07-06  7:10                       ` Joel Becker
2010-07-06  7:10                         ` [Ocfs2-devel] " Joel Becker
2010-07-06  7:09                     ` Joel Becker
2010-07-06  7:09                       ` [Ocfs2-devel] " Joel Becker
2010-07-06 18:39                       ` Joel Becker
2010-07-06 18:39                         ` Joel Becker
2010-07-05  3:51                 ` [PATCH 1/2] ocfs2: Zero the tail cluster when extending past " Tao Ma
2010-07-05  3:51                   ` [Ocfs2-devel] " Tao Ma
2010-07-06  7:17                   ` Joel Becker
2010-07-06  7:17                     ` [Ocfs2-devel] " Joel Becker
2010-07-06  7:54                     ` Tao Ma
2010-07-06  7:54                       ` [Ocfs2-devel] " Tao Ma
2010-07-06 11:58                       ` Joel Becker
2010-07-06 11:58                         ` [Ocfs2-devel] " Joel Becker
2010-07-07  0:42                         ` Tao Ma
2010-07-07  0:42                           ` [Ocfs2-devel] " Tao Ma
2010-07-07  2:03                           ` Joel Becker
2010-07-07  2:03                             ` [Ocfs2-devel] " Joel Becker
2010-07-06 18:48                   ` Joel Becker
2010-07-06 18:48                     ` [Ocfs2-devel] " Joel Becker
2010-07-06 18:57                   ` Joel Becker
2010-07-06 18:57                     ` [Ocfs2-devel] " Joel Becker
2010-07-07 11:16                 ` [PATCH 0/3] ocfs2: Tail zeroing fixes Joel Becker
2010-07-07 11:16                   ` [Ocfs2-devel] " Joel Becker
2010-07-12 22:45                   ` Joel Becker
2010-07-12 22:45                     ` Joel Becker
2010-07-07 11:16                 ` [PATCH 1/3] ocfs2: When zero extending, do it by page Joel Becker
2010-07-07 11:16                   ` [Ocfs2-devel] " Joel Becker
2010-07-07 15:19                   ` Tao Ma
2010-07-07 15:19                     ` [Ocfs2-devel] " Tao Ma
2010-07-07 20:04                     ` Joel Becker
2010-07-07 20:04                       ` [Ocfs2-devel] " Joel Becker
2010-07-08  3:44                   ` Tao Ma
2010-07-08  3:44                     ` [Ocfs2-devel] " Tao Ma
2010-07-08  9:51                     ` Joel Becker
2010-07-08  9:51                       ` [Ocfs2-devel] " Joel Becker
2010-07-07 11:16                 ` [PATCH 2/3] ocfs2: Zero the tail cluster when extending past i_size Joel Becker
2010-07-07 11:16                   ` [Ocfs2-devel] " Joel Becker
2010-07-07 11:16                 ` [PATCH 3/3] ocfs2: No need to zero pages " Joel Becker
2010-07-07 11:16                   ` [Ocfs2-devel] " Joel Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100629005403.GC24343@mail.oracle.com \
    --to=joel.becker@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    --cc=ocfs2-devel@oss.oracle.com \
    --cc=tao.ma@oracle.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.