Re: [PATCH v2] ocfs2: fix data corruption by fallocate

From: Junxiao Bi <junxiao.bi@oracle.com>
To: Joseph Qi <joseph.qi@linux.alibaba.com>, ocfs2-devel@oss.oracle.com
Cc: jack@suse.cz, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v2] ocfs2: fix data corruption by fallocate
Date: Tue, 25 May 2021 22:10:33 -0700	[thread overview]
Message-ID: <297dcd1e-c741-007e-8f53-77eba9656e74@oracle.com> (raw)
In-Reply-To: <cbeec344-adb3-0dc4-51ee-4716c07176f3@linux.alibaba.com>

After moving there, i_size_write will be protected by ip_alloc_sem, 
ocfs2_dio_end_io_write will update i_size without holding inode lock, 
but it does holding ip_alloc_sem.

Thanks,

Junxiao.

On 5/25/21 7:11 PM, Joseph Qi wrote:
> Can we simply replace i_size_read() with 'orig_isize' and leave isize
> update along with other dirty inode operations?
> I think this makes more comfortable for the dirty inode transaction.
>
> Thanks,
> Joseph
>
> On 5/26/21 1:58 AM, Junxiao Bi wrote:
>> I would like make the following change to the patch, is that ok to you?
>>
>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>> index 17469fc7b20e..775657943057 100644
>> --- a/fs/ocfs2/file.c
>> +++ b/fs/ocfs2/file.c
>> @@ -1999,9 +1999,12 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>          }
>>
>>          /* zeroout eof blocks in the cluster. */
>> -       if (!ret && change_size && orig_isize < size)
>> +       if (!ret && change_size && orig_isize < size) {
>>                  ret = ocfs2_zeroout_partial_cluster(inode, orig_isize,
>>                                          size - orig_isize);
>> +               if (!ret)
>> +                       i_size_write(inode, size);
>> +       }
>>          up_write(&OCFS2_I(inode)->ip_alloc_sem);
>>          if (ret) {
>>                  mlog_errno(ret);
>> @@ -2018,9 +2021,6 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>                  goto out_inode_unlock;
>>          }
>>
>> -       if (change_size && i_size_read(inode) < size)
>> -               i_size_write(inode, size);
>> -
>>          inode->i_ctime = inode->i_mtime = current_time(inode);
>>          ret = ocfs2_mark_inode_dirty(handle, inode, di_bh);
>>          if (ret < 0)
>>
>> Thanks,
>>
>> Junxiao.
>>
>> On 5/24/21 7:04 PM, Joseph Qi wrote:
>>> Thanks for the explanations.
>>> A tiny cleanup, we can use 'orig_isize' instead of i_size_read() later
>>> in __ocfs2_change_file_space().
>>> Other looks good to me.
>>> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
>>>
>>> On 5/25/21 12:23 AM, Junxiao Bi wrote:
>>>> That will not work, buffer write zero first, then update i_size, in between writeback could be kicked in and clear those dirty buffers because they were out of i_size. Beside that, OCFS2_IOC_RESVSP64 was never doing right job, it didn't take care eof blocks in the last cluster, that made even a simple fallocate to extend file size could cause corruption. This patch fixed both issues.
>>>>
>>>> Thanks,
>>>>
>>>> Junxiao.
>>>>
>>>> On 5/23/21 4:52 AM, Joseph Qi wrote:
>>>>> Hi Junxiao,
>>>>> If change_size is true (!FALLOC_FL_KEEP_SIZE), it will update isize
>>>>> in __ocfs2_change_file_space(). Why do we have to zeroout first?
>>>>>
>>>>> Thanks,
>>>>> Joseph
>>>>>
>>>>> On 5/22/21 7:36 AM, Junxiao Bi wrote:
>>>>>> When fallocate punches holes out of inode size, if original isize is in
>>>>>> the middle of last cluster, then the part from isize to the end of the
>>>>>> cluster will be zeroed with buffer write, at that time isize is not
>>>>>> yet updated to match the new size, if writeback is kicked in, it will
>>>>>> invoke ocfs2_writepage()->block_write_full_page() where the pages out
>>>>>> of inode size will be dropped. That will cause file corruption. Fix
>>>>>> this by zero out eof blocks when extending the inode size.
>>>>>>
>>>>>> Running the following command with qemu-image 4.2.1 can get a corrupted
>>>>>> coverted image file easily.
>>>>>>
>>>>>>        qemu-img convert -p -t none -T none -f qcow2 $qcow_image \
>>>>>>                 -O qcow2 -o compat=1.1 $qcow_image.conv
>>>>>>
>>>>>> The usage of fallocate in qemu is like this, it first punches holes out of
>>>>>> inode size, then extend the inode size.
>>>>>>
>>>>>>        fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0
>>>>>>        fallocate(11, 0, 2276196352, 65536) = 0
>>>>>>
>>>>>> v1: https://www.spinics.net/lists/linux-fsdevel/msg193999.html
>>>>>>
>>>>>> Cc: <stable@vger.kernel.org>
>>>>>> Cc: Jan Kara <jack@suse.cz>
>>>>>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
>>>>>> ---
>>>>>>
>>>>>> Changes in v2:
>>>>>> - suggested by Jan Kara, using sb_issue_zeroout to zero eof blocks in disk directly.
>>>>>>
>>>>>>     fs/ocfs2/file.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>>>>     1 file changed, 47 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>>>>> index f17c3d33fb18..17469fc7b20e 100644
>>>>>> --- a/fs/ocfs2/file.c
>>>>>> +++ b/fs/ocfs2/file.c
>>>>>> @@ -1855,6 +1855,45 @@ int ocfs2_remove_inode_range(struct inode *inode,
>>>>>>         return ret;
>>>>>>     }
>>>>>>     +/*
>>>>>> + * zero out partial blocks of one cluster.
>>>>>> + *
>>>>>> + * start: file offset where zero starts, will be made upper block aligned.
>>>>>> + * len: it will be trimmed to the end of current cluster if "start + len"
>>>>>> + *      is bigger than it.
>>>>>> + */
>>>>>> +static int ocfs2_zeroout_partial_cluster(struct inode *inode,
>>>>>> +                    u64 start, u64 len)
>>>>>> +{
>>>>>> +    int ret;
>>>>>> +    u64 start_block, end_block, nr_blocks;
>>>>>> +    u64 p_block, offset;
>>>>>> +    u32 cluster, p_cluster, nr_clusters;
>>>>>> +    struct super_block *sb = inode->i_sb;
>>>>>> +    u64 end = ocfs2_align_bytes_to_clusters(sb, start);
>>>>>> +
>>>>>> +    if (start + len < end)
>>>>>> +        end = start + len;
>>>>>> +
>>>>>> +    start_block = ocfs2_blocks_for_bytes(sb, start);
>>>>>> +    end_block = ocfs2_blocks_for_bytes(sb, end);
>>>>>> +    nr_blocks = end_block - start_block;
>>>>>> +    if (!nr_blocks)
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    cluster = ocfs2_bytes_to_clusters(sb, start);
>>>>>> +    ret = ocfs2_get_clusters(inode, cluster, &p_cluster,
>>>>>> +                &nr_clusters, NULL);
>>>>>> +    if (ret)
>>>>>> +        return ret;
>>>>>> +    if (!p_cluster)
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    offset = start_block - ocfs2_clusters_to_blocks(sb, cluster);
>>>>>> +    p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset;
>>>>>> +    return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS);
>>>>>> +}
>>>>>> +
>>>>>>     /*
>>>>>>      * Parts of this function taken from xfs_change_file_space()
>>>>>>      */
>>>>>> @@ -1865,7 +1904,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>>>>>     {
>>>>>>         int ret;
>>>>>>         s64 llen;
>>>>>> -    loff_t size;
>>>>>> +    loff_t size, orig_isize;
>>>>>>         struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>>>>>         struct buffer_head *di_bh = NULL;
>>>>>>         handle_t *handle;
>>>>>> @@ -1896,6 +1935,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>>>>>             goto out_inode_unlock;
>>>>>>         }
>>>>>>     +    orig_isize = i_size_read(inode);
>>>>>>         switch (sr->l_whence) {
>>>>>>         case 0: /*SEEK_SET*/
>>>>>>             break;
>>>>>> @@ -1903,7 +1943,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>>>>>             sr->l_start += f_pos;
>>>>>>             break;
>>>>>>         case 2: /*SEEK_END*/
>>>>>> -        sr->l_start += i_size_read(inode);
>>>>>> +        sr->l_start += orig_isize;
>>>>>>             break;
>>>>>>         default:
>>>>>>             ret = -EINVAL;
>>>>>> @@ -1957,6 +1997,11 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>>>>>>         default:
>>>>>>             ret = -EINVAL;
>>>>>>         }
>>>>>> +
>>>>>> +    /* zeroout eof blocks in the cluster. */
>>>>>> +    if (!ret && change_size && orig_isize < size)
>>>>>> +        ret = ocfs2_zeroout_partial_cluster(inode, orig_isize,
>>>>>> +                    size - orig_isize);
>>>>>>         up_write(&OCFS2_I(inode)->ip_alloc_sem);
>>>>>>         if (ret) {
>>>>>>             mlog_errno(ret);
>>>>>>