All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tao Ma <tao.ma@oracle.com>
To: Dave Chinner <david@fromorbit.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	ocfs2-devel@oss.oracle.com, Dave Chinner <dchinner@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com>,
	Joel Becker <joel.becker@oracle.com>
Subject: Re: [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2
Date: Mon, 05 Jul 2010 11:51:44 +0800	[thread overview]
Message-ID: <4C3156D0.5090001@oracle.com> (raw)
In-Reply-To: <20100703213219.GB21262@mail.oracle.com>

Hi Joel,

On 07/04/2010 05:32 AM, Joel Becker wrote:
> 	Here's the second version of my corruption fix.  It fixes two
> bugs:
>
> 1) i_size can obviously be at a place that is a hole, so don't BUG on
>     that.
> 2) Fix an off-by-one when checking whether the write position is within
>     the tail allocation.
>
> This version passes my tail corruption test as well as the kernel
> compile that exposed the two bugs above.
>
> Joel
>
> ---------------------------------------------------------------
>
> ocfs2's allocation unit is the cluster.  This can be larger than a block
> or even a memory page.  This means that a file may have many blocks in
> its last extent that are beyond the block containing i_size.
>
> When ocfs2 grows a file, it zeros the entire cluster in order to ensure
> future i_size growth will see cleared blocks.  Unfortunately,
> block_write_full_page() drops the pages past i_size.  This means that
> ocfs2 is actually leaking garbage data into the tail end of that last
> cluster.
>
> We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
> when a write or truncate is past i_size.  If there is any existing
> allocation between the block containing the current i_size and the
> location of the write or truncate, zeros will be written to that
> allocation.
>
> This is only for sparse filesystems.  Non-sparse filesystems already get
> this via ocfs2_extend_no_holes().
>
> Signed-off-by: Joel Becker<joel.becker@oracle.com>
> ---
>   fs/ocfs2/aops.c |   22 ++++----
>   fs/ocfs2/file.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
>   fs/ocfs2/file.h |    2 +
>   3 files changed, 150 insertions(+), 28 deletions(-)
>
<snip>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6a13ea6..7fca78d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -848,6 +848,137 @@ out:
>   	return ret;
>   }
>
> +/*
> + * This function is a helper for ocfs2_zero_tail().  It calculates
> + * what blocks need zeroing and does any CoW necessary.
> + */
> +static int ocfs2_zero_tail_prepare(struct inode *inode,
> +				   struct buffer_head *di_bh,
> +				   loff_t pos, u64 *start_blkno,
> +				   u64 *blocks)
> +{
> +	int rc = 0;
> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +	u32 tail_cpos, pos_cpos, p_cpos;
> +	u64 tail_blkno, pos_blkno, blocks_to_zero;
> +	unsigned int num_clusters = 0;
> +	unsigned int ext_flags = 0;
> +
> +	/*
> +	 * The block containing i_size has already been zeroed, so our tail
> +	 * block is the first block after i_size.  The block containing
> +	 * pos will be zeroed.  So we only need to do anything if
> +	 * tail_blkno is before pos_blkno.
> +	 */
> +	tail_blkno = (i_size_read(inode)>>  inode->i_sb->s_blocksize_bits) + 1;
> +	pos_blkno = pos>>  inode->i_sb->s_blocksize_bits;
> +	mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
> +	     (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
> +	if (pos_blkno<= tail_blkno)
> +		goto out;
> +	blocks_to_zero = pos_blkno - tail_blkno;
> +
> +	/*
> +	 * If tail_blkno is in the cluster past i_size, we don't need
> +	 * to touch the cluster containing i_size at all.
> +	 */
> +	tail_cpos = i_size_read(inode)>>  osb->s_clustersize_bits;
> +	if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)>  tail_cpos)
> +		tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> +						     tail_blkno);
Can we always set tail_cpos in one line?
	tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
tail_cpos is either the same cluster as i_size or the next cluster and 
both works for tail_blkno I guess?
> +
> +	rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters,
> +				&ext_flags);
> +	if (rc) {
> +		mlog_errno(rc);
> +		goto out;
> +	}
> +
> +	/* Is there a cluster to zero? */
> +	if (!p_cpos)
> +		goto out;
For unwritten extent, we also need to clear the pages? If yes, the 
solution doesn't complete if we have 2 unwritten extent, one contains 
i_size while one passes i_size. Here we only clear the pages for the 1st 
unwritten extent and leave the 2nd one untouched.
> +
> +	pos_cpos = pos>>  osb->s_clustersize_bits;
> +	mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n",
> +	     (unsigned int)tail_cpos, (unsigned int)num_clusters,
> +	     (unsigned int)pos_cpos, (unsigned long long)tail_blkno,
> +	     (unsigned long long)pos_blkno);
 From here to the call of CoW is a bit hard to understand. In 'if', 
num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it 
isn't easy for the reader to tell why these 2 clauses are setting 
different values. So how about my code below? It looks more 
straightforward I think.
> +	if ((tail_cpos + num_clusters)>  pos_cpos) {
> +		num_clusters = pos_cpos - tail_cpos;
> +		if (pos_blkno>
> +		    ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
> +			num_clusters += 1;
> +	} else {
> +		blocks_to_zero =
> +			ocfs2_clusters_to_blocks(inode->i_sb,
> +						 tail_cpos + num_clusters);
> +		blocks_to_zero -= tail_blkno;
> +	}
> +
> +	/* Now CoW the clusters we're about to zero */
> +	if (ext_flags&  OCFS2_EXT_REFCOUNTED) {
> +		rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
> +					num_clusters, UINT_MAX);
> +		if (rc) {
> +			mlog_errno(rc);
> +			goto out;
> +		}
> +	}
	/* Decrease blocks_to_zero if there is some hole after extent */
	if (tail_cpos + num_clusters <= pos_cpos) {
		blocks_to_zero =
			ocfs2_clusters_to_blocks(inode->i_sb,
						 tail_cpos + num_clusters);
		blocks_to_zero -= tail_blkno;
	}

	/* Now CoW if we have some refcounted clusters. */
	if (ext_flags & OCFS2_EXT_REFCOUNTED) {
		/*
		 * We add one more cluster here since it will be
		 * written shortly and if the pos_blkno isn't aligned
		 * to the cluster size, we have to zero the blocks
		 * before it.
		 */
		if (tail_cpos + num_clusters > pos_cpos)
			num_clusters = pos_cpos - tail_cpos + 1;

		rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
					num_clusters, UINT_MAX);
		if (rc) {
			mlog_errno(rc);
			goto out;
		}

	}
> +
> +	*start_blkno = tail_blkno;
> +	*blocks = blocks_to_zero;
> +	mlog(0, "start_blkno = %llu, blocks = %llu\n",
> +	     (unsigned long long)(*start_blkno),
> +	     (unsigned long long)(*blocks));
> +
> +out:
> +	return rc;
> +}

Regards,
Tao

WARNING: multiple messages have this Message-ID (diff)
From: Tao Ma <tao.ma@oracle.com>
To: Dave Chinner <david@fromorbit.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	ocfs2-devel@oss.oracle.com, Dave Chinner <dchinner@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Mark Fasheh <mfasheh@suse.com>,
	Joel Becker <joel.becker@oracle.com>
Subject: [Ocfs2-devel] [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2
Date: Mon, 05 Jul 2010 11:51:44 +0800	[thread overview]
Message-ID: <4C3156D0.5090001@oracle.com> (raw)
In-Reply-To: <20100703213219.GB21262@mail.oracle.com>

Hi Joel,

On 07/04/2010 05:32 AM, Joel Becker wrote:
> 	Here's the second version of my corruption fix.  It fixes two
> bugs:
>
> 1) i_size can obviously be at a place that is a hole, so don't BUG on
>     that.
> 2) Fix an off-by-one when checking whether the write position is within
>     the tail allocation.
>
> This version passes my tail corruption test as well as the kernel
> compile that exposed the two bugs above.
>
> Joel
>
> ---------------------------------------------------------------
>
> ocfs2's allocation unit is the cluster.  This can be larger than a block
> or even a memory page.  This means that a file may have many blocks in
> its last extent that are beyond the block containing i_size.
>
> When ocfs2 grows a file, it zeros the entire cluster in order to ensure
> future i_size growth will see cleared blocks.  Unfortunately,
> block_write_full_page() drops the pages past i_size.  This means that
> ocfs2 is actually leaking garbage data into the tail end of that last
> cluster.
>
> We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
> when a write or truncate is past i_size.  If there is any existing
> allocation between the block containing the current i_size and the
> location of the write or truncate, zeros will be written to that
> allocation.
>
> This is only for sparse filesystems.  Non-sparse filesystems already get
> this via ocfs2_extend_no_holes().
>
> Signed-off-by: Joel Becker<joel.becker@oracle.com>
> ---
>   fs/ocfs2/aops.c |   22 ++++----
>   fs/ocfs2/file.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
>   fs/ocfs2/file.h |    2 +
>   3 files changed, 150 insertions(+), 28 deletions(-)
>
<snip>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6a13ea6..7fca78d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -848,6 +848,137 @@ out:
>   	return ret;
>   }
>
> +/*
> + * This function is a helper for ocfs2_zero_tail().  It calculates
> + * what blocks need zeroing and does any CoW necessary.
> + */
> +static int ocfs2_zero_tail_prepare(struct inode *inode,
> +				   struct buffer_head *di_bh,
> +				   loff_t pos, u64 *start_blkno,
> +				   u64 *blocks)
> +{
> +	int rc = 0;
> +	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> +	u32 tail_cpos, pos_cpos, p_cpos;
> +	u64 tail_blkno, pos_blkno, blocks_to_zero;
> +	unsigned int num_clusters = 0;
> +	unsigned int ext_flags = 0;
> +
> +	/*
> +	 * The block containing i_size has already been zeroed, so our tail
> +	 * block is the first block after i_size.  The block containing
> +	 * pos will be zeroed.  So we only need to do anything if
> +	 * tail_blkno is before pos_blkno.
> +	 */
> +	tail_blkno = (i_size_read(inode)>>  inode->i_sb->s_blocksize_bits) + 1;
> +	pos_blkno = pos>>  inode->i_sb->s_blocksize_bits;
> +	mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
> +	     (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
> +	if (pos_blkno<= tail_blkno)
> +		goto out;
> +	blocks_to_zero = pos_blkno - tail_blkno;
> +
> +	/*
> +	 * If tail_blkno is in the cluster past i_size, we don't need
> +	 * to touch the cluster containing i_size at all.
> +	 */
> +	tail_cpos = i_size_read(inode)>>  osb->s_clustersize_bits;
> +	if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)>  tail_cpos)
> +		tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> +						     tail_blkno);
Can we always set tail_cpos in one line?
	tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
tail_cpos is either the same cluster as i_size or the next cluster and 
both works for tail_blkno I guess?
> +
> +	rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters,
> +				&ext_flags);
> +	if (rc) {
> +		mlog_errno(rc);
> +		goto out;
> +	}
> +
> +	/* Is there a cluster to zero? */
> +	if (!p_cpos)
> +		goto out;
For unwritten extent, we also need to clear the pages? If yes, the 
solution doesn't complete if we have 2 unwritten extent, one contains 
i_size while one passes i_size. Here we only clear the pages for the 1st 
unwritten extent and leave the 2nd one untouched.
> +
> +	pos_cpos = pos>>  osb->s_clustersize_bits;
> +	mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n",
> +	     (unsigned int)tail_cpos, (unsigned int)num_clusters,
> +	     (unsigned int)pos_cpos, (unsigned long long)tail_blkno,
> +	     (unsigned long long)pos_blkno);
 From here to the call of CoW is a bit hard to understand. In 'if', 
num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it 
isn't easy for the reader to tell why these 2 clauses are setting 
different values. So how about my code below? It looks more 
straightforward I think.
> +	if ((tail_cpos + num_clusters)>  pos_cpos) {
> +		num_clusters = pos_cpos - tail_cpos;
> +		if (pos_blkno>
> +		    ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
> +			num_clusters += 1;
> +	} else {
> +		blocks_to_zero =
> +			ocfs2_clusters_to_blocks(inode->i_sb,
> +						 tail_cpos + num_clusters);
> +		blocks_to_zero -= tail_blkno;
> +	}
> +
> +	/* Now CoW the clusters we're about to zero */
> +	if (ext_flags&  OCFS2_EXT_REFCOUNTED) {
> +		rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
> +					num_clusters, UINT_MAX);
> +		if (rc) {
> +			mlog_errno(rc);
> +			goto out;
> +		}
> +	}
	/* Decrease blocks_to_zero if there is some hole after extent */
	if (tail_cpos + num_clusters <= pos_cpos) {
		blocks_to_zero =
			ocfs2_clusters_to_blocks(inode->i_sb,
						 tail_cpos + num_clusters);
		blocks_to_zero -= tail_blkno;
	}

	/* Now CoW if we have some refcounted clusters. */
	if (ext_flags & OCFS2_EXT_REFCOUNTED) {
		/*
		 * We add one more cluster here since it will be
		 * written shortly and if the pos_blkno isn't aligned
		 * to the cluster size, we have to zero the blocks
		 * before it.
		 */
		if (tail_cpos + num_clusters > pos_cpos)
			num_clusters = pos_cpos - tail_cpos + 1;

		rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
					num_clusters, UINT_MAX);
		if (rc) {
			mlog_errno(rc);
			goto out;
		}

	}
> +
> +	*start_blkno = tail_blkno;
> +	*blocks = blocks_to_zero;
> +	mlog(0, "start_blkno = %llu, blocks = %llu\n",
> +	     (unsigned long long)(*start_blkno),
> +	     (unsigned long long)(*blocks));
> +
> +out:
> +	return rc;
> +}

Regards,
Tao

  parent reply	other threads:[~2010-07-05  3:53 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-28 17:35 [PATCH] Revert "writeback: limit write_cache_pages integrity scanning to current EOF" Joel Becker
2010-06-28 17:35 ` [Ocfs2-devel] " Joel Becker
2010-06-29  0:24 ` Dave Chinner
2010-06-29  0:24   ` [Ocfs2-devel] " Dave Chinner
2010-06-29  0:54   ` Joel Becker
2010-06-29  0:54     ` [Ocfs2-devel] " Joel Becker
2010-06-29  1:12     ` Linus Torvalds
2010-06-29  1:12       ` [Ocfs2-devel] " Linus Torvalds
2010-06-29  1:58       ` Joel Becker
2010-06-29  1:58         ` Joel Becker
2010-06-29  2:20         ` Linus Torvalds
2010-06-29  2:20           ` Linus Torvalds
2010-06-29  2:44           ` Dave Chinner
2010-06-29  2:44             ` Dave Chinner
2010-06-29  8:16           ` Joel Becker
2010-06-29  8:16             ` Joel Becker
2010-06-30  1:30             ` Joel Becker
2010-06-30  1:30               ` Joel Becker
2010-07-06 19:06         ` Joel Becker
2010-07-06 19:06           ` Joel Becker
2010-06-29  1:56     ` Dave Chinner
2010-06-29  1:56       ` [Ocfs2-devel] " Dave Chinner
2010-06-29  2:04       ` Joel Becker
2010-06-29  2:04         ` [Ocfs2-devel] " Joel Becker
2010-06-29  2:27         ` Dave Chinner
2010-06-29  2:27           ` [Ocfs2-devel] " Dave Chinner
2010-06-29  7:18           ` Joel Becker
2010-06-29  7:18             ` [Ocfs2-devel] " Joel Becker
2010-07-02 22:49             ` [PATCH] ocfs2: Zero the tail cluster when extending past i_size Joel Becker
2010-07-02 22:49               ` [Ocfs2-devel] " Joel Becker
2010-07-03 21:32               ` [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2 Joel Becker
2010-07-03 21:32                 ` [Ocfs2-devel] " Joel Becker
2010-07-03 21:33                 ` [PATCH 2/2] ocfs2: No need to zero pages past i_size. " Joel Becker
2010-07-03 21:33                   ` [Ocfs2-devel] " Joel Becker
2010-07-04 15:13                   ` Tao Ma
2010-07-04 15:13                     ` [Ocfs2-devel] " Tao Ma
2010-07-05  1:38                     ` Tao Ma
2010-07-05  1:38                       ` [Ocfs2-devel] " Tao Ma
2010-07-06  7:10                       ` Joel Becker
2010-07-06  7:10                         ` [Ocfs2-devel] " Joel Becker
2010-07-06  7:09                     ` Joel Becker
2010-07-06  7:09                       ` [Ocfs2-devel] " Joel Becker
2010-07-06 18:39                       ` Joel Becker
2010-07-06 18:39                         ` Joel Becker
2010-07-05  3:51                 ` Tao Ma [this message]
2010-07-05  3:51                   ` [Ocfs2-devel] [PATCH 1/2] ocfs2: Zero the tail cluster when extending past " Tao Ma
2010-07-06  7:17                   ` Joel Becker
2010-07-06  7:17                     ` [Ocfs2-devel] " Joel Becker
2010-07-06  7:54                     ` Tao Ma
2010-07-06  7:54                       ` [Ocfs2-devel] " Tao Ma
2010-07-06 11:58                       ` Joel Becker
2010-07-06 11:58                         ` [Ocfs2-devel] " Joel Becker
2010-07-07  0:42                         ` Tao Ma
2010-07-07  0:42                           ` [Ocfs2-devel] " Tao Ma
2010-07-07  2:03                           ` Joel Becker
2010-07-07  2:03                             ` [Ocfs2-devel] " Joel Becker
2010-07-06 18:48                   ` Joel Becker
2010-07-06 18:48                     ` [Ocfs2-devel] " Joel Becker
2010-07-06 18:57                   ` Joel Becker
2010-07-06 18:57                     ` [Ocfs2-devel] " Joel Becker
2010-07-07 11:16                 ` [PATCH 0/3] ocfs2: Tail zeroing fixes Joel Becker
2010-07-07 11:16                   ` [Ocfs2-devel] " Joel Becker
2010-07-12 22:45                   ` Joel Becker
2010-07-12 22:45                     ` Joel Becker
2010-07-07 11:16                 ` [PATCH 1/3] ocfs2: When zero extending, do it by page Joel Becker
2010-07-07 11:16                   ` [Ocfs2-devel] " Joel Becker
2010-07-07 15:19                   ` Tao Ma
2010-07-07 15:19                     ` [Ocfs2-devel] " Tao Ma
2010-07-07 20:04                     ` Joel Becker
2010-07-07 20:04                       ` [Ocfs2-devel] " Joel Becker
2010-07-08  3:44                   ` Tao Ma
2010-07-08  3:44                     ` [Ocfs2-devel] " Tao Ma
2010-07-08  9:51                     ` Joel Becker
2010-07-08  9:51                       ` [Ocfs2-devel] " Joel Becker
2010-07-07 11:16                 ` [PATCH 2/3] ocfs2: Zero the tail cluster when extending past i_size Joel Becker
2010-07-07 11:16                   ` [Ocfs2-devel] " Joel Becker
2010-07-07 11:16                 ` [PATCH 3/3] ocfs2: No need to zero pages " Joel Becker
2010-07-07 11:16                   ` [Ocfs2-devel] " Joel Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C3156D0.5090001@oracle.com \
    --to=tao.ma@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=hch@lst.de \
    --cc=joel.becker@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    --cc=ocfs2-devel@oss.oracle.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.