From: Joel Becker <Joel.Becker@oracle.com> To: Dave Chinner <david@fromorbit.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>, Linux Kernel <linux-kernel@vger.kernel.org>, ocfs2-devel@oss.oracle.com, Tao Ma <tao.ma@oracle.com>, Dave Chinner <dchinner@redhat.com>, Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com> Subject: Re: [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF" Date: Mon, 28 Jun 2010 17:54:04 -0700 [thread overview] Message-ID: <20100629005403.GC24343@mail.oracle.com> (raw) In-Reply-To: <20100629002421.GY6590@dastard> On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote: > On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote: > > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c. > > Hi Joel, > > I have no problems with it being reverted - it's a really just a > WAR for the simplest case of the sync hold holdoff. I have to insist that we revert it until we find a way to make ocfs2 work. The rest of the email will discuss the ocfs2 issues therein. > > This patch causes any filesystem with an allocation unit larger than the > > filesystem blocksize will leak unzeroed data. During a file extend, the > > entire allocation unit is zeroed. > > XFS has this same underlying issue - it can have uninitialised, > allocated blocks past EOF that have to be zeroed when extending the > file. Does XFS do this in get_blocks()? We deliberately do no allocation in get_blocks(), which is where our need for up-front allocation comes from. > > However, this patch prevents the tail > > blocks of the allocation unit from being written back to disk. When the > > file is next extended, i_size will now cover these unzeroed blocks, > > leaking the old contents of the disk to userspace and creating a corrupt > > file. > > XFS doesn't zero blocks at allocation. Instead, XFS zeros the range > between the old EOF and the new EOF on each extending write. Hence > these pages get written because they fall inside the new i_size that > is set during the write. The i_size on disk doesn't get changed > until after the data writes have completed, so even on a crash we > don't expose uninitialised blocks. We do the same, but we zero the entire allocation. This works both when filling holes and when extending, though obviously the extending is what we're worried about here. We change i_size in write_end, so our guarantee is the same as yours for the page containing i_size. > Looking at ocfs2_writepage(), it simply calls > block_write_full_page(), which does: > > /* Is the page fully outside i_size? (truncate in progress) */ > offset = i_size & (PAGE_CACHE_SIZE-1); > if (page->index >= end_index+1 || !offset) { > /* > * The page may have dirty, unmapped buffers. For example, > * they may have been added in ext3_writepage(). Make them > * freeable here, so the page does not leak. > */ > do_invalidatepage(page, 0); > unlock_page(page); > return 0; /* don't care */ > } > > i.e. pages beyond EOF get invalidated. If it somehow gets through > that check, __block_write_full_page() will avoid writing dirty > bufferheads beyond EOF because the write is "racing with truncate". Your contention is that we've never gotten those tail blocks to disk. Instead, our code either handles the future extensions of i_size or we've just gotten lucky with our testing. Our current BUG trigger is because we have a new check that catches this case. Does that summarize your position correctly? I'm not averse to having a zero-only-till-i_size policy, but I know we've visited this problem before and got bit. I have to go reload that context. Regarding XFS, how do you handle catching the tail of an allocation with an lseek(2)'d write? That is, your current allocation has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF and write there. The code has to recognize to zero around old_i_size before moving out to new_i_size, right? I think that's where our old approaches had problems. Joel -- "The real reason GNU ls is 8-bit-clean is so that they can start using ISO-8859-1 option characters." - Christopher Davis (ckd@loiosh.kei.com) Joel Becker Consulting Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127
WARNING: multiple messages have this Message-ID (diff)
From: Joel Becker <Joel.Becker@oracle.com> To: Dave Chinner <david@fromorbit.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>, Linux Kernel <linux-kernel@vger.kernel.org>, ocfs2-devel@oss.oracle.com, Tao Ma <tao.ma@oracle.com>, Dave Chinner <dchinner@redhat.com>, Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com> Subject: [Ocfs2-devel] [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF" Date: Mon, 28 Jun 2010 17:54:04 -0700 [thread overview] Message-ID: <20100629005403.GC24343@mail.oracle.com> (raw) In-Reply-To: <20100629002421.GY6590@dastard> On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote: > On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote: > > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c. > > Hi Joel, > > I have no problems with it being reverted - it's a really just a > WAR for the simplest case of the sync hold holdoff. I have to insist that we revert it until we find a way to make ocfs2 work. The rest of the email will discuss the ocfs2 issues therein. > > This patch causes any filesystem with an allocation unit larger than the > > filesystem blocksize will leak unzeroed data. During a file extend, the > > entire allocation unit is zeroed. > > XFS has this same underlying issue - it can have uninitialised, > allocated blocks past EOF that have to be zeroed when extending the > file. Does XFS do this in get_blocks()? We deliberately do no allocation in get_blocks(), which is where our need for up-front allocation comes from. > > However, this patch prevents the tail > > blocks of the allocation unit from being written back to disk. When the > > file is next extended, i_size will now cover these unzeroed blocks, > > leaking the old contents of the disk to userspace and creating a corrupt > > file. > > XFS doesn't zero blocks at allocation. Instead, XFS zeros the range > between the old EOF and the new EOF on each extending write. Hence > these pages get written because they fall inside the new i_size that > is set during the write. The i_size on disk doesn't get changed > until after the data writes have completed, so even on a crash we > don't expose uninitialised blocks. We do the same, but we zero the entire allocation. This works both when filling holes and when extending, though obviously the extending is what we're worried about here. We change i_size in write_end, so our guarantee is the same as yours for the page containing i_size. > Looking at ocfs2_writepage(), it simply calls > block_write_full_page(), which does: > > /* Is the page fully outside i_size? (truncate in progress) */ > offset = i_size & (PAGE_CACHE_SIZE-1); > if (page->index >= end_index+1 || !offset) { > /* > * The page may have dirty, unmapped buffers. For example, > * they may have been added in ext3_writepage(). Make them > * freeable here, so the page does not leak. > */ > do_invalidatepage(page, 0); > unlock_page(page); > return 0; /* don't care */ > } > > i.e. pages beyond EOF get invalidated. If it somehow gets through > that check, __block_write_full_page() will avoid writing dirty > bufferheads beyond EOF because the write is "racing with truncate". Your contention is that we've never gotten those tail blocks to disk. Instead, our code either handles the future extensions of i_size or we've just gotten lucky with our testing. Our current BUG trigger is because we have a new check that catches this case. Does that summarize your position correctly? I'm not averse to having a zero-only-till-i_size policy, but I know we've visited this problem before and got bit. I have to go reload that context. Regarding XFS, how do you handle catching the tail of an allocation with an lseek(2)'d write? That is, your current allocation has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF and write there. The code has to recognize to zero around old_i_size before moving out to new_i_size, right? I think that's where our old approaches had problems. Joel -- "The real reason GNU ls is 8-bit-clean is so that they can start using ISO-8859-1 option characters." - Christopher Davis (ckd at loiosh.kei.com) Joel Becker Consulting Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127
next prev parent reply other threads:[~2010-06-29 0:56 UTC|newest] Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top 2010-06-28 17:35 [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF" Joel Becker 2010-06-28 17:35 ` [Ocfs2-devel] " Joel Becker 2010-06-29 0:24 ` Dave Chinner 2010-06-29 0:24 ` [Ocfs2-devel] " Dave Chinner 2010-06-29 0:54 ` Joel Becker [this message] 2010-06-29 0:54 ` Joel Becker 2010-06-29 1:12 ` Linus Torvalds 2010-06-29 1:12 ` [Ocfs2-devel] " Linus Torvalds 2010-06-29 1:58 ` Joel Becker 2010-06-29 1:58 ` Joel Becker 2010-06-29 2:20 ` Linus Torvalds 2010-06-29 2:20 ` Linus Torvalds 2010-06-29 2:44 ` Dave Chinner 2010-06-29 2:44 ` Dave Chinner 2010-06-29 8:16 ` Joel Becker 2010-06-29 8:16 ` Joel Becker 2010-06-30 1:30 ` Joel Becker 2010-06-30 1:30 ` Joel Becker 2010-07-06 19:06 ` Joel Becker 2010-07-06 19:06 ` Joel Becker 2010-06-29 1:56 ` Dave Chinner 2010-06-29 1:56 ` [Ocfs2-devel] " Dave Chinner 2010-06-29 2:04 ` Joel Becker 2010-06-29 2:04 ` [Ocfs2-devel] " Joel Becker 2010-06-29 2:27 ` Dave Chinner 2010-06-29 2:27 ` [Ocfs2-devel] " Dave Chinner 2010-06-29 7:18 ` Joel Becker 2010-06-29 7:18 ` [Ocfs2-devel] " Joel Becker 2010-07-02 22:49 ` [PATCH] ocfs2: Zero the tail cluster when extending past i_size Joel Becker 2010-07-02 22:49 ` [Ocfs2-devel] " Joel Becker 2010-07-03 21:32 ` [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2 Joel Becker 2010-07-03 21:32 ` [Ocfs2-devel] " Joel Becker 2010-07-03 21:33 ` [PATCH 2/2] ocfs2: No need to zero pages past i_size. " Joel Becker 2010-07-03 21:33 ` [Ocfs2-devel] " Joel Becker 2010-07-04 15:13 ` Tao Ma 2010-07-04 15:13 ` [Ocfs2-devel] " Tao Ma 2010-07-05 1:38 ` Tao Ma 2010-07-05 1:38 ` [Ocfs2-devel] " Tao Ma 2010-07-06 7:10 ` Joel Becker 2010-07-06 7:10 ` [Ocfs2-devel] " Joel Becker 2010-07-06 7:09 ` Joel Becker 2010-07-06 7:09 ` [Ocfs2-devel] " Joel Becker 2010-07-06 18:39 ` Joel Becker 2010-07-06 18:39 ` Joel Becker 2010-07-05 3:51 ` [PATCH 1/2] ocfs2: Zero the tail cluster when extending past " Tao Ma 2010-07-05 3:51 ` [Ocfs2-devel] " Tao Ma 2010-07-06 7:17 ` Joel Becker 2010-07-06 7:17 ` [Ocfs2-devel] " Joel Becker 2010-07-06 7:54 ` Tao Ma 2010-07-06 7:54 ` [Ocfs2-devel] " Tao Ma 2010-07-06 11:58 ` Joel Becker 2010-07-06 11:58 ` [Ocfs2-devel] " Joel Becker 2010-07-07 0:42 ` Tao Ma 2010-07-07 0:42 ` [Ocfs2-devel] " Tao Ma 2010-07-07 2:03 ` Joel Becker 2010-07-07 2:03 ` [Ocfs2-devel] " Joel Becker 2010-07-06 18:48 ` Joel Becker 2010-07-06 18:48 ` [Ocfs2-devel] " Joel Becker 2010-07-06 18:57 ` Joel Becker 2010-07-06 18:57 ` [Ocfs2-devel] " Joel Becker 2010-07-07 11:16 ` [PATCH 0/3] ocfs2: Tail zeroing fixes Joel Becker 2010-07-07 11:16 ` [Ocfs2-devel] " Joel Becker 2010-07-12 22:45 ` Joel Becker 2010-07-12 22:45 ` Joel Becker 2010-07-07 11:16 ` [PATCH 1/3] ocfs2: When zero extending, do it by page Joel Becker 2010-07-07 11:16 ` [Ocfs2-devel] " Joel Becker 2010-07-07 15:19 ` Tao Ma 2010-07-07 15:19 ` [Ocfs2-devel] " Tao Ma 2010-07-07 20:04 ` Joel Becker 2010-07-07 20:04 ` [Ocfs2-devel] " Joel Becker 2010-07-08 3:44 ` Tao Ma 2010-07-08 3:44 ` [Ocfs2-devel] " Tao Ma 2010-07-08 9:51 ` Joel Becker 2010-07-08 9:51 ` [Ocfs2-devel] " Joel Becker 2010-07-07 11:16 ` [PATCH 2/3] ocfs2: Zero the tail cluster when extending past i_size Joel Becker 2010-07-07 11:16 ` [Ocfs2-devel] " Joel Becker 2010-07-07 11:16 ` [PATCH 3/3] ocfs2: No need to zero pages " Joel Becker 2010-07-07 11:16 ` [Ocfs2-devel] " Joel Becker
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20100629005403.GC24343@mail.oracle.com \ --to=joel.becker@oracle.com \ --cc=david@fromorbit.com \ --cc=dchinner@redhat.com \ --cc=hch@lst.de \ --cc=linux-kernel@vger.kernel.org \ --cc=mfasheh@suse.com \ --cc=ocfs2-devel@oss.oracle.com \ --cc=tao.ma@oracle.com \ --cc=torvalds@linux-foundation.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.