From: Tao Ma <tao.ma@oracle.com> To: Dave Chinner <david@fromorbit.com>, Linus Torvalds <torvalds@linux-foundation.org>, Linux Kernel <linux-kernel@vger.kernel.org>, ocfs2-devel@oss.oracle.com, Dave Chinner <dchinner@redhat.com>, Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com>, Joel Becker <joel.becker@oracle.com> Subject: Re: [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2 Date: Mon, 05 Jul 2010 11:51:44 +0800 [thread overview] Message-ID: <4C3156D0.5090001@oracle.com> (raw) In-Reply-To: <20100703213219.GB21262@mail.oracle.com> Hi Joel, On 07/04/2010 05:32 AM, Joel Becker wrote: > Here's the second version of my corruption fix. It fixes two > bugs: > > 1) i_size can obviously be at a place that is a hole, so don't BUG on > that. > 2) Fix an off-by-one when checking whether the write position is within > the tail allocation. > > This version passes my tail corruption test as well as the kernel > compile that exposed the two bugs above. > > Joel > > --------------------------------------------------------------- > > ocfs2's allocation unit is the cluster. This can be larger than a block > or even a memory page. This means that a file may have many blocks in > its last extent that are beyond the block containing i_size. > > When ocfs2 grows a file, it zeros the entire cluster in order to ensure > future i_size growth will see cleared blocks. Unfortunately, > block_write_full_page() drops the pages past i_size. This means that > ocfs2 is actually leaking garbage data into the tail end of that last > cluster. > > We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect > when a write or truncate is past i_size. If there is any existing > allocation between the block containing the current i_size and the > location of the write or truncate, zeros will be written to that > allocation. > > This is only for sparse filesystems. Non-sparse filesystems already get > this via ocfs2_extend_no_holes(). > > Signed-off-by: Joel Becker<joel.becker@oracle.com> > --- > fs/ocfs2/aops.c | 22 ++++---- > fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------ > fs/ocfs2/file.h | 2 + > 3 files changed, 150 insertions(+), 28 deletions(-) > <snip> > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c > index 6a13ea6..7fca78d 100644 > --- a/fs/ocfs2/file.c > +++ b/fs/ocfs2/file.c > @@ -848,6 +848,137 @@ out: > return ret; > } > > +/* > + * This function is a helper for ocfs2_zero_tail(). It calculates > + * what blocks need zeroing and does any CoW necessary. > + */ > +static int ocfs2_zero_tail_prepare(struct inode *inode, > + struct buffer_head *di_bh, > + loff_t pos, u64 *start_blkno, > + u64 *blocks) > +{ > + int rc = 0; > + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); > + u32 tail_cpos, pos_cpos, p_cpos; > + u64 tail_blkno, pos_blkno, blocks_to_zero; > + unsigned int num_clusters = 0; > + unsigned int ext_flags = 0; > + > + /* > + * The block containing i_size has already been zeroed, so our tail > + * block is the first block after i_size. The block containing > + * pos will be zeroed. So we only need to do anything if > + * tail_blkno is before pos_blkno. > + */ > + tail_blkno = (i_size_read(inode)>> inode->i_sb->s_blocksize_bits) + 1; > + pos_blkno = pos>> inode->i_sb->s_blocksize_bits; > + mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n", > + (unsigned long long)tail_blkno, (unsigned long long)pos_blkno); > + if (pos_blkno<= tail_blkno) > + goto out; > + blocks_to_zero = pos_blkno - tail_blkno; > + > + /* > + * If tail_blkno is in the cluster past i_size, we don't need > + * to touch the cluster containing i_size at all. > + */ > + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits; > + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos) > + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, > + tail_blkno); Can we always set tail_cpos in one line? tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)? tail_cpos is either the same cluster as i_size or the next cluster and both works for tail_blkno I guess? > + > + rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters, > + &ext_flags); > + if (rc) { > + mlog_errno(rc); > + goto out; > + } > + > + /* Is there a cluster to zero? */ > + if (!p_cpos) > + goto out; For unwritten extent, we also need to clear the pages? If yes, the solution doesn't complete if we have 2 unwritten extent, one contains i_size while one passes i_size. Here we only clear the pages for the 1st unwritten extent and leave the 2nd one untouched. > + > + pos_cpos = pos>> osb->s_clustersize_bits; > + mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n", > + (unsigned int)tail_cpos, (unsigned int)num_clusters, > + (unsigned int)pos_cpos, (unsigned long long)tail_blkno, > + (unsigned long long)pos_blkno); From here to the call of CoW is a bit hard to understand. In 'if', num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it isn't easy for the reader to tell why these 2 clauses are setting different values. So how about my code below? It looks more straightforward I think. > + if ((tail_cpos + num_clusters)> pos_cpos) { > + num_clusters = pos_cpos - tail_cpos; > + if (pos_blkno> > + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos)) > + num_clusters += 1; > + } else { > + blocks_to_zero = > + ocfs2_clusters_to_blocks(inode->i_sb, > + tail_cpos + num_clusters); > + blocks_to_zero -= tail_blkno; > + } > + > + /* Now CoW the clusters we're about to zero */ > + if (ext_flags& OCFS2_EXT_REFCOUNTED) { > + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos, > + num_clusters, UINT_MAX); > + if (rc) { > + mlog_errno(rc); > + goto out; > + } > + } /* Decrease blocks_to_zero if there is some hole after extent */ if (tail_cpos + num_clusters <= pos_cpos) { blocks_to_zero = ocfs2_clusters_to_blocks(inode->i_sb, tail_cpos + num_clusters); blocks_to_zero -= tail_blkno; } /* Now CoW if we have some refcounted clusters. */ if (ext_flags & OCFS2_EXT_REFCOUNTED) { /* * We add one more cluster here since it will be * written shortly and if the pos_blkno isn't aligned * to the cluster size, we have to zero the blocks * before it. */ if (tail_cpos + num_clusters > pos_cpos) num_clusters = pos_cpos - tail_cpos + 1; rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos, num_clusters, UINT_MAX); if (rc) { mlog_errno(rc); goto out; } } > + > + *start_blkno = tail_blkno; > + *blocks = blocks_to_zero; > + mlog(0, "start_blkno = %llu, blocks = %llu\n", > + (unsigned long long)(*start_blkno), > + (unsigned long long)(*blocks)); > + > +out: > + return rc; > +} Regards, Tao
WARNING: multiple messages have this Message-ID (diff)
From: Tao Ma <tao.ma@oracle.com> To: Dave Chinner <david@fromorbit.com>, Linus Torvalds <torvalds@linux-foundation.org>, Linux Kernel <linux-kernel@vger.kernel.org>, ocfs2-devel@oss.oracle.com, Dave Chinner <dchinner@redhat.com>, Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com>, Joel Becker <joel.becker@oracle.com> Subject: [Ocfs2-devel] [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2 Date: Mon, 05 Jul 2010 11:51:44 +0800 [thread overview] Message-ID: <4C3156D0.5090001@oracle.com> (raw) In-Reply-To: <20100703213219.GB21262@mail.oracle.com> Hi Joel, On 07/04/2010 05:32 AM, Joel Becker wrote: > Here's the second version of my corruption fix. It fixes two > bugs: > > 1) i_size can obviously be at a place that is a hole, so don't BUG on > that. > 2) Fix an off-by-one when checking whether the write position is within > the tail allocation. > > This version passes my tail corruption test as well as the kernel > compile that exposed the two bugs above. > > Joel > > --------------------------------------------------------------- > > ocfs2's allocation unit is the cluster. This can be larger than a block > or even a memory page. This means that a file may have many blocks in > its last extent that are beyond the block containing i_size. > > When ocfs2 grows a file, it zeros the entire cluster in order to ensure > future i_size growth will see cleared blocks. Unfortunately, > block_write_full_page() drops the pages past i_size. This means that > ocfs2 is actually leaking garbage data into the tail end of that last > cluster. > > We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect > when a write or truncate is past i_size. If there is any existing > allocation between the block containing the current i_size and the > location of the write or truncate, zeros will be written to that > allocation. > > This is only for sparse filesystems. Non-sparse filesystems already get > this via ocfs2_extend_no_holes(). > > Signed-off-by: Joel Becker<joel.becker@oracle.com> > --- > fs/ocfs2/aops.c | 22 ++++---- > fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------ > fs/ocfs2/file.h | 2 + > 3 files changed, 150 insertions(+), 28 deletions(-) > <snip> > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c > index 6a13ea6..7fca78d 100644 > --- a/fs/ocfs2/file.c > +++ b/fs/ocfs2/file.c > @@ -848,6 +848,137 @@ out: > return ret; > } > > +/* > + * This function is a helper for ocfs2_zero_tail(). It calculates > + * what blocks need zeroing and does any CoW necessary. > + */ > +static int ocfs2_zero_tail_prepare(struct inode *inode, > + struct buffer_head *di_bh, > + loff_t pos, u64 *start_blkno, > + u64 *blocks) > +{ > + int rc = 0; > + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); > + u32 tail_cpos, pos_cpos, p_cpos; > + u64 tail_blkno, pos_blkno, blocks_to_zero; > + unsigned int num_clusters = 0; > + unsigned int ext_flags = 0; > + > + /* > + * The block containing i_size has already been zeroed, so our tail > + * block is the first block after i_size. The block containing > + * pos will be zeroed. So we only need to do anything if > + * tail_blkno is before pos_blkno. > + */ > + tail_blkno = (i_size_read(inode)>> inode->i_sb->s_blocksize_bits) + 1; > + pos_blkno = pos>> inode->i_sb->s_blocksize_bits; > + mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n", > + (unsigned long long)tail_blkno, (unsigned long long)pos_blkno); > + if (pos_blkno<= tail_blkno) > + goto out; > + blocks_to_zero = pos_blkno - tail_blkno; > + > + /* > + * If tail_blkno is in the cluster past i_size, we don't need > + * to touch the cluster containing i_size at all. > + */ > + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits; > + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos) > + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, > + tail_blkno); Can we always set tail_cpos in one line? tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)? tail_cpos is either the same cluster as i_size or the next cluster and both works for tail_blkno I guess? > + > + rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters, > + &ext_flags); > + if (rc) { > + mlog_errno(rc); > + goto out; > + } > + > + /* Is there a cluster to zero? */ > + if (!p_cpos) > + goto out; For unwritten extent, we also need to clear the pages? If yes, the solution doesn't complete if we have 2 unwritten extent, one contains i_size while one passes i_size. Here we only clear the pages for the 1st unwritten extent and leave the 2nd one untouched. > + > + pos_cpos = pos>> osb->s_clustersize_bits; > + mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n", > + (unsigned int)tail_cpos, (unsigned int)num_clusters, > + (unsigned int)pos_cpos, (unsigned long long)tail_blkno, > + (unsigned long long)pos_blkno); From here to the call of CoW is a bit hard to understand. In 'if', num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it isn't easy for the reader to tell why these 2 clauses are setting different values. So how about my code below? It looks more straightforward I think. > + if ((tail_cpos + num_clusters)> pos_cpos) { > + num_clusters = pos_cpos - tail_cpos; > + if (pos_blkno> > + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos)) > + num_clusters += 1; > + } else { > + blocks_to_zero = > + ocfs2_clusters_to_blocks(inode->i_sb, > + tail_cpos + num_clusters); > + blocks_to_zero -= tail_blkno; > + } > + > + /* Now CoW the clusters we're about to zero */ > + if (ext_flags& OCFS2_EXT_REFCOUNTED) { > + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos, > + num_clusters, UINT_MAX); > + if (rc) { > + mlog_errno(rc); > + goto out; > + } > + } /* Decrease blocks_to_zero if there is some hole after extent */ if (tail_cpos + num_clusters <= pos_cpos) { blocks_to_zero = ocfs2_clusters_to_blocks(inode->i_sb, tail_cpos + num_clusters); blocks_to_zero -= tail_blkno; } /* Now CoW if we have some refcounted clusters. */ if (ext_flags & OCFS2_EXT_REFCOUNTED) { /* * We add one more cluster here since it will be * written shortly and if the pos_blkno isn't aligned * to the cluster size, we have to zero the blocks * before it. */ if (tail_cpos + num_clusters > pos_cpos) num_clusters = pos_cpos - tail_cpos + 1; rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos, num_clusters, UINT_MAX); if (rc) { mlog_errno(rc); goto out; } } > + > + *start_blkno = tail_blkno; > + *blocks = blocks_to_zero; > + mlog(0, "start_blkno = %llu, blocks = %llu\n", > + (unsigned long long)(*start_blkno), > + (unsigned long long)(*blocks)); > + > +out: > + return rc; > +} Regards, Tao
next prev parent reply other threads:[~2010-07-05 3:53 UTC|newest] Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top 2010-06-28 17:35 [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF" Joel Becker 2010-06-28 17:35 ` [Ocfs2-devel] " Joel Becker 2010-06-29 0:24 ` Dave Chinner 2010-06-29 0:24 ` [Ocfs2-devel] " Dave Chinner 2010-06-29 0:54 ` Joel Becker 2010-06-29 0:54 ` [Ocfs2-devel] " Joel Becker 2010-06-29 1:12 ` Linus Torvalds 2010-06-29 1:12 ` [Ocfs2-devel] " Linus Torvalds 2010-06-29 1:58 ` Joel Becker 2010-06-29 1:58 ` Joel Becker 2010-06-29 2:20 ` Linus Torvalds 2010-06-29 2:20 ` Linus Torvalds 2010-06-29 2:44 ` Dave Chinner 2010-06-29 2:44 ` Dave Chinner 2010-06-29 8:16 ` Joel Becker 2010-06-29 8:16 ` Joel Becker 2010-06-30 1:30 ` Joel Becker 2010-06-30 1:30 ` Joel Becker 2010-07-06 19:06 ` Joel Becker 2010-07-06 19:06 ` Joel Becker 2010-06-29 1:56 ` Dave Chinner 2010-06-29 1:56 ` [Ocfs2-devel] " Dave Chinner 2010-06-29 2:04 ` Joel Becker 2010-06-29 2:04 ` [Ocfs2-devel] " Joel Becker 2010-06-29 2:27 ` Dave Chinner 2010-06-29 2:27 ` [Ocfs2-devel] " Dave Chinner 2010-06-29 7:18 ` Joel Becker 2010-06-29 7:18 ` [Ocfs2-devel] " Joel Becker 2010-07-02 22:49 ` [PATCH] ocfs2: Zero the tail cluster when extending past i_size Joel Becker 2010-07-02 22:49 ` [Ocfs2-devel] " Joel Becker 2010-07-03 21:32 ` [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2 Joel Becker 2010-07-03 21:32 ` [Ocfs2-devel] " Joel Becker 2010-07-03 21:33 ` [PATCH 2/2] ocfs2: No need to zero pages past i_size. " Joel Becker 2010-07-03 21:33 ` [Ocfs2-devel] " Joel Becker 2010-07-04 15:13 ` Tao Ma 2010-07-04 15:13 ` [Ocfs2-devel] " Tao Ma 2010-07-05 1:38 ` Tao Ma 2010-07-05 1:38 ` [Ocfs2-devel] " Tao Ma 2010-07-06 7:10 ` Joel Becker 2010-07-06 7:10 ` [Ocfs2-devel] " Joel Becker 2010-07-06 7:09 ` Joel Becker 2010-07-06 7:09 ` [Ocfs2-devel] " Joel Becker 2010-07-06 18:39 ` Joel Becker 2010-07-06 18:39 ` Joel Becker 2010-07-05 3:51 ` Tao Ma [this message] 2010-07-05 3:51 ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Zero the tail cluster when extending past " Tao Ma 2010-07-06 7:17 ` Joel Becker 2010-07-06 7:17 ` [Ocfs2-devel] " Joel Becker 2010-07-06 7:54 ` Tao Ma 2010-07-06 7:54 ` [Ocfs2-devel] " Tao Ma 2010-07-06 11:58 ` Joel Becker 2010-07-06 11:58 ` [Ocfs2-devel] " Joel Becker 2010-07-07 0:42 ` Tao Ma 2010-07-07 0:42 ` [Ocfs2-devel] " Tao Ma 2010-07-07 2:03 ` Joel Becker 2010-07-07 2:03 ` [Ocfs2-devel] " Joel Becker 2010-07-06 18:48 ` Joel Becker 2010-07-06 18:48 ` [Ocfs2-devel] " Joel Becker 2010-07-06 18:57 ` Joel Becker 2010-07-06 18:57 ` [Ocfs2-devel] " Joel Becker 2010-07-07 11:16 ` [PATCH 0/3] ocfs2: Tail zeroing fixes Joel Becker 2010-07-07 11:16 ` [Ocfs2-devel] " Joel Becker 2010-07-12 22:45 ` Joel Becker 2010-07-12 22:45 ` Joel Becker 2010-07-07 11:16 ` [PATCH 1/3] ocfs2: When zero extending, do it by page Joel Becker 2010-07-07 11:16 ` [Ocfs2-devel] " Joel Becker 2010-07-07 15:19 ` Tao Ma 2010-07-07 15:19 ` [Ocfs2-devel] " Tao Ma 2010-07-07 20:04 ` Joel Becker 2010-07-07 20:04 ` [Ocfs2-devel] " Joel Becker 2010-07-08 3:44 ` Tao Ma 2010-07-08 3:44 ` [Ocfs2-devel] " Tao Ma 2010-07-08 9:51 ` Joel Becker 2010-07-08 9:51 ` [Ocfs2-devel] " Joel Becker 2010-07-07 11:16 ` [PATCH 2/3] ocfs2: Zero the tail cluster when extending past i_size Joel Becker 2010-07-07 11:16 ` [Ocfs2-devel] " Joel Becker 2010-07-07 11:16 ` [PATCH 3/3] ocfs2: No need to zero pages " Joel Becker 2010-07-07 11:16 ` [Ocfs2-devel] " Joel Becker
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=4C3156D0.5090001@oracle.com \ --to=tao.ma@oracle.com \ --cc=david@fromorbit.com \ --cc=dchinner@redhat.com \ --cc=hch@lst.de \ --cc=joel.becker@oracle.com \ --cc=linux-kernel@vger.kernel.org \ --cc=mfasheh@suse.com \ --cc=ocfs2-devel@oss.oracle.com \ --cc=torvalds@linux-foundation.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.