ocfs2-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH V2 1/2] ocfs2: fix zero out valid data
@ 2021-07-22  5:49 Junxiao Bi
  2021-07-22  5:49 ` [Ocfs2-devel] [PATCH V2 2/2] ocfs2: issue zeroout to EOF blocks Junxiao Bi
  0 siblings, 1 reply; 3+ messages in thread
From: Junxiao Bi @ 2021-07-22  5:49 UTC (permalink / raw)
  To: ocfs2-devel

If append-dio feature is enabled, direct-io write and fallocate could run
in parallel to extend file size, fallocate used "orig_isize" to record
i_size before taking "ip_alloc_sem", when ocfs2_zeroout_partial_cluster()
zeroout EOF blocks, i_size maybe already extended by ocfs2_dio_end_io_write(),
that will cause valid data zeroed out.

Fixes: 6bba4471f0cc ("ocfs2: fix data corruption by fallocate")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
---
Changes in V2:
- fix commit id in the patch header

 fs/ocfs2/file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 775657943057..53bb46ce3cbb 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1935,7 +1935,6 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
 		goto out_inode_unlock;
 	}
 
-	orig_isize = i_size_read(inode);
 	switch (sr->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
@@ -1943,7 +1942,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
 		sr->l_start += f_pos;
 		break;
 	case 2: /*SEEK_END*/
-		sr->l_start += orig_isize;
+		sr->l_start += i_size_read(inode);
 		break;
 	default:
 		ret = -EINVAL;
@@ -1998,6 +1997,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
 		ret = -EINVAL;
 	}
 
+	orig_isize = i_size_read(inode);
 	/* zeroout eof blocks in the cluster. */
 	if (!ret && change_size && orig_isize < size) {
 		ret = ocfs2_zeroout_partial_cluster(inode, orig_isize,
-- 
2.24.3 (Apple Git-128)


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [Ocfs2-devel] [PATCH V2 2/2] ocfs2: issue zeroout to EOF blocks
  2021-07-22  5:49 [Ocfs2-devel] [PATCH V2 1/2] ocfs2: fix zero out valid data Junxiao Bi
@ 2021-07-22  5:49 ` Junxiao Bi
  2021-07-22  5:59   ` Joseph Qi
  0 siblings, 1 reply; 3+ messages in thread
From: Junxiao Bi @ 2021-07-22  5:49 UTC (permalink / raw)
  To: ocfs2-devel

For punch holes in EOF blocks, fallocate used buffer write to zero
the EOF blocks in last cluster. But since ->writepage will ignore
EOF pages, those zeros will not be flushed. This "looks" ok as
commit 6bba4471f0cc ("ocfs2: fix data corruption by fallocate")
will zero the EOF blocks when extend the file size, but it isn't.
The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had
DIRTY flag set, when writeback run by write_cache_pages(), DIRTY
flag on the page was cleared, but DIRTY flag on the buffer_head
not. When next write happened to those EOF pages, since buffer_head
already had DIRTY flag set, it would not mark page DIRTY again.
That made writeback ignore them forever. That will cause data
corruption. Even directio write can't work because it will fail
when trying to drop pages caches before direct io, as it found
the buffer_head for those pages still had DIRTY flag set, then
it will fall back to buffer io mode.
To make a summary of the issue, as writeback ingores EOF pages,
once any EOF page is generated, any write to it will only go
to the page cache, it will never be flushed to disk even file
size extends and that page is not EOF page any more.
The fix is to avoid zero EOF blocks with buffer write.

The following code snippet from qemu-img could trigger the corruption.

656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
...
660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
660   fallocate(11, 0, 2275868672, 327680) = 0
658   pwrite64(11, "\0\31\237\v\0\336\330\f\0\373~\r\0\300\270\16\0\335^\17\0\242\230\20\0\277>\21\0\204x\22"..., 311296, 2275868672) = 311296

Cc: <stable@vger.kernel.org>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
---

Changes in V2:
 - small code style change

 fs/ocfs2/file.c | 99 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 60 insertions(+), 39 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 53bb46ce3cbb..54d7843c0211 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1529,6 +1529,45 @@ static void ocfs2_truncate_cluster_pages(struct inode *inode, u64 byte_start,
 	}
 }
 
+/*
+ * zero out partial blocks of one cluster.
+ *
+ * start: file offset where zero starts, will be made upper block aligned.
+ * len: it will be trimmed to the end of current cluster if "start + len"
+ *      is bigger than it.
+ */
+static int ocfs2_zeroout_partial_cluster(struct inode *inode,
+					u64 start, u64 len)
+{
+	int ret;
+	u64 start_block, end_block, nr_blocks;
+	u64 p_block, offset;
+	u32 cluster, p_cluster, nr_clusters;
+	struct super_block *sb = inode->i_sb;
+	u64 end = ocfs2_align_bytes_to_clusters(sb, start);
+
+	if (start + len < end)
+		end = start + len;
+
+	start_block = ocfs2_blocks_for_bytes(sb, start);
+	end_block = ocfs2_blocks_for_bytes(sb, end);
+	nr_blocks = end_block - start_block;
+	if (!nr_blocks)
+		return 0;
+
+	cluster = ocfs2_bytes_to_clusters(sb, start);
+	ret = ocfs2_get_clusters(inode, cluster, &p_cluster,
+				&nr_clusters, NULL);
+	if (ret)
+		return ret;
+	if (!p_cluster)
+		return 0;
+
+	offset = start_block - ocfs2_clusters_to_blocks(sb, cluster);
+	p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset;
+	return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS);
+}
+
 static int ocfs2_zero_partial_clusters(struct inode *inode,
 				       u64 start, u64 len)
 {
@@ -1538,6 +1577,7 @@ static int ocfs2_zero_partial_clusters(struct inode *inode,
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	unsigned int csize = osb->s_clustersize;
 	handle_t *handle;
+	loff_t isize = i_size_read(inode);
 
 	/*
 	 * The "start" and "end" values are NOT necessarily part of
@@ -1558,6 +1598,26 @@ static int ocfs2_zero_partial_clusters(struct inode *inode,
 	if ((start & (csize - 1)) == 0 && (end & (csize - 1)) == 0)
 		goto out;
 
+	/* No page cache for EOF blocks, issue zero out to disk. */
+	if (end > isize) {
+		/*
+		 * zeroout eof blocks in last cluster starting from
+		 * "isize" even "start" > "isize" because it is
+		 * complicated to zeroout just at "start" as "start"
+		 * may be not aligned with block size, buffer write
+		 * would be required to do that, but out of eof buffer
+		 * write is not supported.
+		 */
+		ret = ocfs2_zeroout_partial_cluster(inode, isize,
+					end - isize);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+		if (start >= isize)
+			goto out;
+		end = isize;
+	}
 	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
@@ -1855,45 +1915,6 @@ int ocfs2_remove_inode_range(struct inode *inode,
 	return ret;
 }
 
-/*
- * zero out partial blocks of one cluster.
- *
- * start: file offset where zero starts, will be made upper block aligned.
- * len: it will be trimmed to the end of current cluster if "start + len"
- *      is bigger than it.
- */
-static int ocfs2_zeroout_partial_cluster(struct inode *inode,
-					u64 start, u64 len)
-{
-	int ret;
-	u64 start_block, end_block, nr_blocks;
-	u64 p_block, offset;
-	u32 cluster, p_cluster, nr_clusters;
-	struct super_block *sb = inode->i_sb;
-	u64 end = ocfs2_align_bytes_to_clusters(sb, start);
-
-	if (start + len < end)
-		end = start + len;
-
-	start_block = ocfs2_blocks_for_bytes(sb, start);
-	end_block = ocfs2_blocks_for_bytes(sb, end);
-	nr_blocks = end_block - start_block;
-	if (!nr_blocks)
-		return 0;
-
-	cluster = ocfs2_bytes_to_clusters(sb, start);
-	ret = ocfs2_get_clusters(inode, cluster, &p_cluster,
-				&nr_clusters, NULL);
-	if (ret)
-		return ret;
-	if (!p_cluster)
-		return 0;
-
-	offset = start_block - ocfs2_clusters_to_blocks(sb, cluster);
-	p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset;
-	return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS);
-}
-
 /*
  * Parts of this function taken from xfs_change_file_space()
  */
-- 
2.24.3 (Apple Git-128)


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [Ocfs2-devel] [PATCH V2 2/2] ocfs2: issue zeroout to EOF blocks
  2021-07-22  5:49 ` [Ocfs2-devel] [PATCH V2 2/2] ocfs2: issue zeroout to EOF blocks Junxiao Bi
@ 2021-07-22  5:59   ` Joseph Qi
  0 siblings, 0 replies; 3+ messages in thread
From: Joseph Qi @ 2021-07-22  5:59 UTC (permalink / raw)
  To: Junxiao Bi, ocfs2-devel



On 7/22/21 1:49 PM, Junxiao Bi wrote:
> For punch holes in EOF blocks, fallocate used buffer write to zero
> the EOF blocks in last cluster. But since ->writepage will ignore
> EOF pages, those zeros will not be flushed. This "looks" ok as
> commit 6bba4471f0cc ("ocfs2: fix data corruption by fallocate")
> will zero the EOF blocks when extend the file size, but it isn't.
> The problem happened on those EOF pages, before writeback, those
> pages had DIRTY flag set and all buffer_head in them also had
> DIRTY flag set, when writeback run by write_cache_pages(), DIRTY
> flag on the page was cleared, but DIRTY flag on the buffer_head
> not. When next write happened to those EOF pages, since buffer_head
> already had DIRTY flag set, it would not mark page DIRTY again.
> That made writeback ignore them forever. That will cause data
> corruption. Even directio write can't work because it will fail
> when trying to drop pages caches before direct io, as it found
> the buffer_head for those pages still had DIRTY flag set, then
> it will fall back to buffer io mode.
> To make a summary of the issue, as writeback ingores EOF pages,
> once any EOF page is generated, any write to it will only go
> to the page cache, it will never be flushed to disk even file
> size extends and that page is not EOF page any more.
> The fix is to avoid zero EOF blocks with buffer write.
> 
> The following code snippet from qemu-img could trigger the corruption.
> 
> 656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
> ...
> 660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
> 660   fallocate(11, 0, 2275868672, 327680) = 0
> 658   pwrite64(11, "\0\31\237\v\0\336\330\f\0\373~\r\0\300\270\16\0\335^\17\0\242\230\20\0\277>\21\0\204x\22"..., 311296, 2275868672) = 311296
> 
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> ---
> 
> Changes in V2:
>  - small code style change
> 
>  fs/ocfs2/file.c | 99 ++++++++++++++++++++++++++++++-------------------
>  1 file changed, 60 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 53bb46ce3cbb..54d7843c0211 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1529,6 +1529,45 @@ static void ocfs2_truncate_cluster_pages(struct inode *inode, u64 byte_start,
>  	}
>  }
>  
> +/*
> + * zero out partial blocks of one cluster.
> + *
> + * start: file offset where zero starts, will be made upper block aligned.
> + * len: it will be trimmed to the end of current cluster if "start + len"
> + *      is bigger than it.
> + */
> +static int ocfs2_zeroout_partial_cluster(struct inode *inode,
> +					u64 start, u64 len)
> +{
> +	int ret;
> +	u64 start_block, end_block, nr_blocks;
> +	u64 p_block, offset;
> +	u32 cluster, p_cluster, nr_clusters;
> +	struct super_block *sb = inode->i_sb;
> +	u64 end = ocfs2_align_bytes_to_clusters(sb, start);
> +
> +	if (start + len < end)
> +		end = start + len;
> +
> +	start_block = ocfs2_blocks_for_bytes(sb, start);
> +	end_block = ocfs2_blocks_for_bytes(sb, end);
> +	nr_blocks = end_block - start_block;
> +	if (!nr_blocks)
> +		return 0;
> +
> +	cluster = ocfs2_bytes_to_clusters(sb, start);
> +	ret = ocfs2_get_clusters(inode, cluster, &p_cluster,
> +				&nr_clusters, NULL);
> +	if (ret)
> +		return ret;
> +	if (!p_cluster)
> +		return 0;
> +
> +	offset = start_block - ocfs2_clusters_to_blocks(sb, cluster);
> +	p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset;
> +	return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS);
> +}
> +
>  static int ocfs2_zero_partial_clusters(struct inode *inode,
>  				       u64 start, u64 len)
>  {
> @@ -1538,6 +1577,7 @@ static int ocfs2_zero_partial_clusters(struct inode *inode,
>  	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>  	unsigned int csize = osb->s_clustersize;
>  	handle_t *handle;
> +	loff_t isize = i_size_read(inode);
>  
>  	/*
>  	 * The "start" and "end" values are NOT necessarily part of
> @@ -1558,6 +1598,26 @@ static int ocfs2_zero_partial_clusters(struct inode *inode,
>  	if ((start & (csize - 1)) == 0 && (end & (csize - 1)) == 0)
>  		goto out;
>  
> +	/* No page cache for EOF blocks, issue zero out to disk. */
> +	if (end > isize) {
> +		/*
> +		 * zeroout eof blocks in last cluster starting from
> +		 * "isize" even "start" > "isize" because it is
> +		 * complicated to zeroout just at "start" as "start"
> +		 * may be not aligned with block size, buffer write
> +		 * would be required to do that, but out of eof buffer
> +		 * write is not supported.
> +		 */
> +		ret = ocfs2_zeroout_partial_cluster(inode, isize,
> +					end - isize);
> +		if (ret) {
> +			mlog_errno(ret);
> +			goto out;
> +		}
> +		if (start >= isize)
> +			goto out;
> +		end = isize;
> +	}
>  	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
>  	if (IS_ERR(handle)) {
>  		ret = PTR_ERR(handle);
> @@ -1855,45 +1915,6 @@ int ocfs2_remove_inode_range(struct inode *inode,
>  	return ret;
>  }
>  
> -/*
> - * zero out partial blocks of one cluster.
> - *
> - * start: file offset where zero starts, will be made upper block aligned.
> - * len: it will be trimmed to the end of current cluster if "start + len"
> - *      is bigger than it.
> - */
> -static int ocfs2_zeroout_partial_cluster(struct inode *inode,
> -					u64 start, u64 len)
> -{
> -	int ret;
> -	u64 start_block, end_block, nr_blocks;
> -	u64 p_block, offset;
> -	u32 cluster, p_cluster, nr_clusters;
> -	struct super_block *sb = inode->i_sb;
> -	u64 end = ocfs2_align_bytes_to_clusters(sb, start);
> -
> -	if (start + len < end)
> -		end = start + len;
> -
> -	start_block = ocfs2_blocks_for_bytes(sb, start);
> -	end_block = ocfs2_blocks_for_bytes(sb, end);
> -	nr_blocks = end_block - start_block;
> -	if (!nr_blocks)
> -		return 0;
> -
> -	cluster = ocfs2_bytes_to_clusters(sb, start);
> -	ret = ocfs2_get_clusters(inode, cluster, &p_cluster,
> -				&nr_clusters, NULL);
> -	if (ret)
> -		return ret;
> -	if (!p_cluster)
> -		return 0;
> -
> -	offset = start_block - ocfs2_clusters_to_blocks(sb, cluster);
> -	p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset;
> -	return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS);
> -}
> -
>  /*
>   * Parts of this function taken from xfs_change_file_space()
>   */
> 

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-07-22  5:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22  5:49 [Ocfs2-devel] [PATCH V2 1/2] ocfs2: fix zero out valid data Junxiao Bi
2021-07-22  5:49 ` [Ocfs2-devel] [PATCH V2 2/2] ocfs2: issue zeroout to EOF blocks Junxiao Bi
2021-07-22  5:59   ` Joseph Qi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).