Re: [Ocfs2-devel] [PATCH v2] ocfs2: fix data corruption by fallocate

From: Joseph Qi <joseph.qi@linux.alibaba.com>
To: Junxiao Bi <junxiao.bi@oracle.com>, ocfs2-devel@oss.oracle.com
Cc: linux-fsdevel@vger.kernel.org, jack@suse.cz
Subject: Re: [Ocfs2-devel] [PATCH v2] ocfs2: fix data corruption by fallocate
Date: Tue, 25 May 2021 10:04:44 +0800	[thread overview]
Message-ID: <21d8b289-541d-50f5-6f86-de3ee69c56c5@linux.alibaba.com> (raw)
In-Reply-To: <8aa90f5d-e4db-5107-1d3c-383294871196@oracle.com>

Thanks for the explanations.
A tiny cleanup, we can use 'orig_isize' instead of i_size_read() later
in __ocfs2_change_file_space().
Other looks good to me.
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

On 5/25/21 12:23 AM, Junxiao Bi wrote:
> That will not work, buffer write zero first, then update i_size, in between writeback could be kicked in and clear those dirty buffers because they were out of i_size. Beside that, OCFS2_IOC_RESVSP64 was never doing right job, it didn't take care eof blocks in the last cluster, that made even a simple fallocate to extend file size could cause corruption. This patch fixed both issues.
> 
> Thanks,
> 
> Junxiao.
> 
> On 5/23/21 4:52 AM, Joseph Qi wrote:
>> Hi Junxiao,
>> If change_size is true (!FALLOC_FL_KEEP_SIZE), it will update isize
>> in __ocfs2_change_file_space(). Why do we have to zeroout first?
>>
>> Thanks,
>> Joseph
>>
>> On 5/22/21 7:36 AM, Junxiao Bi wrote:
>>> When fallocate punches holes out of inode size, if original isize is in
>>> the middle of last cluster, then the part from isize to the end of the
>>> cluster will be zeroed with buffer write, at that time isize is not
>>> yet updated to match the new size, if writeback is kicked in, it will
>>> invoke ocfs2_writepage()->block_write_full_page() where the pages out
>>> of inode size will be dropped. That will cause file corruption. Fix
>>> this by zero out eof blocks when extending the inode size.
>>>
>>> Running the following command with qemu-image 4.2.1 can get a corrupted
>>> coverted image file easily.
>>>
>>>      qemu-img convert -p -t none -T none -f qcow2 $qcow_image \
>>>               -O qcow2 -o compat=1.1 $qcow_image.conv
>>>
>>> The usage of fallocate in qemu is like this, it first punches holes out of
>>> inode size, then extend the inode size.
>>>
>>>      fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0
>>>      fallocate(11, 0, 2276196352, 65536) = 0
>>>
>>> v1: https://www.spinics.net/lists/linux-fsdevel/msg193999.html
>>>
>>> Cc: <stable@vger.kernel.org>
>>> Cc: Jan Kara <jack@suse.cz>
>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
>>> ---
>>>
>>> Changes in v2:
>>> - suggested by Jan Kara, using sb_issue_zeroout to zero eof blocks in disk directly.
>>>
>>>   fs/ocfs2/file.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>   1 file changed, 47 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>> index f17c3d33fb18..17469fc7b20e 100644
>>> --- a/fs/ocfs2/file.c
>>> +++ b/fs/ocfs2/file.c
>>> @@ -1855,6 +1855,45 @@ int ocfs2_remove_inode_range(struct inode *inode,
>>>       return ret;
>>>   }
>>>   +/*
>>> + * zero out partial blocks of one cluster.
>>> + *
>>> + * start: file offset where zero starts, will be made upper block aligned.
>>> + * len: it will be trimmed to the end of current cluster if "start + len"
>>> + *      is bigger than it.
>>> + */
>>> +static int ocfs2_zeroout_partial_cluster(struct inode *inode,
>>> +                    u64 start, u64 len)
>>> +{
>>> +    int ret;
>>> +    u64 start_block, end_block, nr_blocks;
>>> +    u64 p_block, offset;
>>> +    u32 cluster, p_cluster, nr_clusters;
>>> +    struct super_block *sb = inode->i_sb;
>>> +    u64 end = ocfs2_align_bytes_to_clusters(sb, start);
>>> +
>>> +    if (start + len < end)
>>> +        end = start + len;
>>> +
>>> +    start_block = ocfs2_blocks_for_bytes(sb, start);
>>> +    end_block = ocfs2_blocks_for_bytes(sb, end);
>>> +    nr_blocks = end_block - start_block;
>>> +    if (!nr_blocks)
>>> +        return 0;
>>> +
>>> +    cluster = ocfs2_bytes_to_clusters(sb, start);
>>> +    ret = ocfs2_get_clusters(inode, cluster, &p_cluster,
>>> +                &nr_clusters, NULL);
>>> +    if (ret)
>>> +        return ret;
>>> +    if (!p_cluster)
>>> +        return 0;
>>> +
>>> +    offset = start_block - ocfs2_clusters_to_blocks(sb, cluster);
>>> +    p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset;
>>> +    return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS);
>>> +}
>>> +
>>>   /*
>>>    * Parts of this function taken from xfs_change_file_space()
>>>    */
>>> @@ -1865,7 +1904,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>>   {
>>>       int ret;
>>>       s64 llen;
>>> -    loff_t size;
>>> +    loff_t size, orig_isize;
>>>       struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>>       struct buffer_head *di_bh = NULL;
>>>       handle_t *handle;
>>> @@ -1896,6 +1935,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>>           goto out_inode_unlock;
>>>       }
>>>   +    orig_isize = i_size_read(inode);
>>>       switch (sr->l_whence) {
>>>       case 0: /*SEEK_SET*/
>>>           break;
>>> @@ -1903,7 +1943,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>>           sr->l_start += f_pos;
>>>           break;
>>>       case 2: /*SEEK_END*/
>>> -        sr->l_start += i_size_read(inode);
>>> +        sr->l_start += orig_isize;
>>>           break;
>>>       default:
>>>           ret = -EINVAL;
>>> @@ -1957,6 +1997,11 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>>       default:
>>>           ret = -EINVAL;
>>>       }
>>> +
>>> +    /* zeroout eof blocks in the cluster. */
>>> +    if (!ret && change_size && orig_isize < size)
>>> +        ret = ocfs2_zeroout_partial_cluster(inode, orig_isize,
>>> +                    size - orig_isize);
>>>       up_write(&OCFS2_I(inode)->ip_alloc_sem);
>>>       if (ret) {
>>>           mlog_errno(ret);
>>>

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel