All of lore.kernel.org
 help / color / mirror / Atom feed
* [f2fs-dev]  Potential data corruption?
@ 2019-12-07 10:10 =?gb18030?B?uuzJ1bXEzf67r7H9?=
  2019-12-08  4:00 ` Chao Yu
  0 siblings, 1 reply; 6+ messages in thread
From: =?gb18030?B?uuzJ1bXEzf67r7H9?= @ 2019-12-07 10:10 UTC (permalink / raw)
  To: =?gb18030?B?bGludXgtZjJmcy1kZXZlbA==?=

Hi F2FS experts,
The following confuses me:

A typical fsync() goes like this:
1) Issue data block IOs
2) Wait for completion
3) Issue chained node block IOs
4) Wait for completion
5) Issue flush command

In order to preserve data consistency under sudden power failure, it requires that the storage device persists data blocks prior to node blocks.
Otherwise, under sudden power failure, it's possible that the persisted node block points to NULL data blocks.

However, according to this study (https://www.usenix.org/conference/fast18/presentation/won), the persistent order of requests doesn't necessarily equals to the request finish order (due to device volatile caches). This means that its possible that the node blocks get persisted prior to data blocks.

Does F2FS have other mechanisms to prevent such inconsistency? Or does it require the device to persist data without reordering?

Thanks!

Hongwei
_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [f2fs-dev] Potential data corruption?
  2019-12-07 10:10 [f2fs-dev] Potential data corruption? =?gb18030?B?uuzJ1bXEzf67r7H9?=
@ 2019-12-08  4:00 ` Chao Yu
  2019-12-08 13:15   ` Hongwei Qin
  0 siblings, 1 reply; 6+ messages in thread
From: Chao Yu @ 2019-12-08  4:00 UTC (permalink / raw)
  To: 红烧的威化饼, linux-f2fs-devel

Hello,

On 2019-12-7 18:10, 红烧的威化饼 wrote:
> Hi F2FS experts,
> The following confuses me:
>
> A typical fsync() goes like this:
> 1) Issue data block IOs
> 2) Wait for completion
> 3) Issue chained node block IOs
> 4) Wait for completion
> 5) Issue flush command
>
> In order to preserve data consistency under sudden power failure, it requires that the storage device persists data blocks prior to node blocks.
> Otherwise, under sudden power failure, it's possible that the persisted node block points to NULL data blocks.

Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
successfully before sudden power-cut, so we can not guarantee that data is fully
persisted in such condition.

However, what you want looks like atomic write semantics, which mostly database
want to guarantee during db file update.

F2FS has support atomic_write via ioctl, which is used by SQLite officially, I
guess you can check its implementation detail.

Thanks,

>
> However, according to this study (https://www.usenix.org/conference/fast18/presentation/won), the persistent order of requests doesn't necessarily equals to the request finish order (due to device volatile caches). This means that its possible that the node blocks get persisted prior to data blocks.
>
> Does F2FS have other mechanisms to prevent such inconsistency? Or does it require the device to persist data without reordering?
>
> Thanks!
>
> Hongwei
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [f2fs-dev] Potential data corruption?
  2019-12-08  4:00 ` Chao Yu
@ 2019-12-08 13:15   ` Hongwei Qin
  2019-12-08 13:41     ` Chao Yu
  2019-12-08 13:51     ` Gao Xiang via Linux-f2fs-devel
  0 siblings, 2 replies; 6+ messages in thread
From: Hongwei Qin @ 2019-12-08 13:15 UTC (permalink / raw)
  To: Chao Yu; +Cc: linux-f2fs-devel

Hi,

On Sun, Dec 8, 2019 at 12:01 PM Chao Yu <chao@kernel.org> wrote:
>
> Hello,
>
> On 2019-12-7 18:10, 红烧的威化饼 wrote:
> > Hi F2FS experts,
> > The following confuses me:
> >
> > A typical fsync() goes like this:
> > 1) Issue data block IOs
> > 2) Wait for completion
> > 3) Issue chained node block IOs
> > 4) Wait for completion
> > 5) Issue flush command
> >
> > In order to preserve data consistency under sudden power failure, it requires that the storage device persists data blocks prior to node blocks.
> > Otherwise, under sudden power failure, it's possible that the persisted node block points to NULL data blocks.
>
> Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
> successfully before sudden power-cut, so we can not guarantee that data is fully
> persisted in such condition.
>
> However, what you want looks like atomic write semantics, which mostly database
> want to guarantee during db file update.
>
> F2FS has support atomic_write via ioctl, which is used by SQLite officially, I
> guess you can check its implementation detail.
>
> Thanks,
>

Thanks for your kind reply.
It's true that if we meet power failure before fsync() completes,
POSIX doen't require FS to recover the file. However, consider the
following situation:

1) Data block IOs (Not persisted)
2) Node block IOs (All Persisted)
3) Power failure

Since the node blocks are all persisted before power failure, the node
chain isn't broken. Note that this file's new data is not properly
persisted before crash. So the recovery process should be able to
recognize this situation and avoid recover this file. However, since
the node chain is not broken, perhaps the recovery process will regard
this file as recoverable?

Thanks!

> >
> > However, according to this study (https://www.usenix.org/conference/fast18/presentation/won), the persistent order of requests doesn't necessarily equals to the request finish order (due to device volatile caches). This means that its possible that the node blocks get persisted prior to data blocks.
> >
> > Does F2FS have other mechanisms to prevent such inconsistency? Or does it require the device to persist data without reordering?
> >
> > Thanks!
> >
> > Hongwei
> > _______________________________________________
> > Linux-f2fs-devel mailing list
> > Linux-f2fs-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> >
>
>
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [f2fs-dev] Potential data corruption?
  2019-12-08 13:15   ` Hongwei Qin
@ 2019-12-08 13:41     ` Chao Yu
  2019-12-08 13:51     ` Gao Xiang via Linux-f2fs-devel
  1 sibling, 0 replies; 6+ messages in thread
From: Chao Yu @ 2019-12-08 13:41 UTC (permalink / raw)
  To: Hongwei Qin; +Cc: linux-f2fs-devel

Hi,

On 2019-12-8 21:15, Hongwei Qin wrote:
> Hi,
>
> On Sun, Dec 8, 2019 at 12:01 PM Chao Yu <chao@kernel.org> wrote:
>>
>> Hello,
>>
>> On 2019-12-7 18:10, 红烧的威化饼 wrote:
>>> Hi F2FS experts,
>>> The following confuses me:
>>>
>>> A typical fsync() goes like this:
>>> 1) Issue data block IOs
>>> 2) Wait for completion
>>> 3) Issue chained node block IOs
>>> 4) Wait for completion
>>> 5) Issue flush command
>>>
>>> In order to preserve data consistency under sudden power failure, it requires that the storage device persists data blocks prior to node blocks.
>>> Otherwise, under sudden power failure, it's possible that the persisted node block points to NULL data blocks.
>>
>> Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
>> successfully before sudden power-cut, so we can not guarantee that data is fully
>> persisted in such condition.
>>
>> However, what you want looks like atomic write semantics, which mostly database
>> want to guarantee during db file update.
>>
>> F2FS has support atomic_write via ioctl, which is used by SQLite officially, I
>> guess you can check its implementation detail.
>>
>> Thanks,
>>
>
> Thanks for your kind reply.
> It's true that if we meet power failure before fsync() completes,
> POSIX doen't require FS to recover the file. However, consider the
> following situation:
>
> 1) Data block IOs (Not persisted)
> 2) Node block IOs (All Persisted)
> 3) Power failure
>
> Since the node blocks are all persisted before power failure, the node
> chain isn't broken. Note that this file's new data is not properly
> persisted before crash. So the recovery process should be able to
> recognize this situation and avoid recover this file. However, since
> the node chain is not broken, perhaps the recovery process will regard
> this file as recoverable?

So this is why atomic write submission will tag PREFLUSH & FUA in last node bio 
to keep all data IO being persisted before node IO, and recovery flag is only 
tagged in last node block of node chain, if the last node block is not be 
persisted, all atomic write data will not be recovered. With this mechanism, we 
can guarantee atomic write semantics.

__write_node_page()
{
...
	if (atomic && !test_opt(sbi, NOBARRIER))
		fio.op_flags |= REQ_PREFLUSH | REQ_FUA;
...
}

f2fs_fsync_node_page()
{
...
			if (!atomic || page == last_page) {
				set_fsync_mark(page, 1);
				if (IS_INODE(page)) {
					if (is_inode_flag_set(inode,
								FI_DIRTY_INODE))
						f2fs_update_inode(inode, page);
					set_dentry_mark(page,
						f2fs_need_dentry_mark(sbi, ino));
				}
				/*  may be written by other thread */
				if (!PageDirty(page))
					set_page_dirty(page);
			}
...
}

>
> Thanks!
>
>>>
>>> However, according to this study (https://www.usenix.org/conference/fast18/presentation/won), the persistent order of requests doesn't necessarily equals to the request finish order (due to device volatile caches). This means that its possible that the node blocks get persisted prior to data blocks.
>>>
>>> Does F2FS have other mechanisms to prevent such inconsistency? Or does it require the device to persist data without reordering?
>>>
>>> Thanks!
>>>
>>> Hongwei
>>> _______________________________________________
>>> Linux-f2fs-devel mailing list
>>> Linux-f2fs-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>
>>
>>
>> _______________________________________________
>> Linux-f2fs-devel mailing list
>> Linux-f2fs-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [f2fs-dev] Potential data corruption?
  2019-12-08 13:15   ` Hongwei Qin
  2019-12-08 13:41     ` Chao Yu
@ 2019-12-08 13:51     ` Gao Xiang via Linux-f2fs-devel
  2019-12-09 10:46       ` Chao Yu
  1 sibling, 1 reply; 6+ messages in thread
From: Gao Xiang via Linux-f2fs-devel @ 2019-12-08 13:51 UTC (permalink / raw)
  To: Hongwei Qin; +Cc: linux-f2fs-devel

Hi,

On Sun, Dec 08, 2019 at 09:15:55PM +0800, Hongwei Qin wrote:
> Hi,
> 
> On Sun, Dec 8, 2019 at 12:01 PM Chao Yu <chao@kernel.org> wrote:
> >
> > Hello,
> >
> > On 2019-12-7 18:10, 锟斤拷锟秸碉拷锟斤拷锟斤拷锟斤拷 wrote:
> > > Hi F2FS experts,
> > > The following confuses me:
> > >
> > > A typical fsync() goes like this:
> > > 1) Issue data block IOs
> > > 2) Wait for completion
> > > 3) Issue chained node block IOs
> > > 4) Wait for completion
> > > 5) Issue flush command
> > >
> > > In order to preserve data consistency under sudden power failure, it requires that the storage device persists data blocks prior to node blocks.
> > > Otherwise, under sudden power failure, it's possible that the persisted node block points to NULL data blocks.
> >
> > Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
> > successfully before sudden power-cut, so we can not guarantee that data is fully
> > persisted in such condition.
> >
> > However, what you want looks like atomic write semantics, which mostly database
> > want to guarantee during db file update.
> >
> > F2FS has support atomic_write via ioctl, which is used by SQLite officially, I
> > guess you can check its implementation detail.
> >
> > Thanks,
> >
> 
> Thanks for your kind reply.
> It's true that if we meet power failure before fsync() completes,
> POSIX doen't require FS to recover the file. However, consider the
> following situation:
> 
> 1) Data block IOs (Not persisted)
> 2) Node block IOs (All Persisted)
> 3) Power failure
> 
> Since the node blocks are all persisted before power failure, the node
> chain isn't broken. Note that this file's new data is not properly
> persisted before crash. So the recovery process should be able to
> recognize this situation and avoid recover this file. However, since
> the node chain is not broken, perhaps the recovery process will regard
> this file as recoverable?

As my own limited understanding, I'm afraid it seems true for extreme case.
Without proper FLUSH command, newer nodes could be recovered but no newer
data persisted.

So if fsync() is not successful, the old data should be readed
but for this case, unexpected data (not A or A', could be random data
C) will be considered validly since its node is ok.

It seems it should FLUSH data before the related node chain written or
introduce some data checksum though.

If I am wrong, kindly correct me...

Thanks,
Gao Xiang



_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [f2fs-dev] Potential data corruption?
  2019-12-08 13:51     ` Gao Xiang via Linux-f2fs-devel
@ 2019-12-09 10:46       ` Chao Yu
  0 siblings, 0 replies; 6+ messages in thread
From: Chao Yu @ 2019-12-09 10:46 UTC (permalink / raw)
  To: Gao Xiang, Hongwei Qin; +Cc: linux-f2fs-devel

On 2019/12/8 21:51, Gao Xiang via Linux-f2fs-devel wrote:
> Hi,
> 
> On Sun, Dec 08, 2019 at 09:15:55PM +0800, Hongwei Qin wrote:
>> Hi,
>>
>> On Sun, Dec 8, 2019 at 12:01 PM Chao Yu <chao@kernel.org> wrote:
>>>
>>> Hello,
>>>
>>> On 2019-12-7 18:10, 锟斤拷锟秸碉拷锟斤拷锟斤拷锟斤拷 wrote:
>>>> Hi F2FS experts,
>>>> The following confuses me:
>>>>
>>>> A typical fsync() goes like this:
>>>> 1) Issue data block IOs
>>>> 2) Wait for completion
>>>> 3) Issue chained node block IOs
>>>> 4) Wait for completion
>>>> 5) Issue flush command
>>>>
>>>> In order to preserve data consistency under sudden power failure, it requires that the storage device persists data blocks prior to node blocks.
>>>> Otherwise, under sudden power failure, it's possible that the persisted node block points to NULL data blocks.
>>>
>>> Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
>>> successfully before sudden power-cut, so we can not guarantee that data is fully
>>> persisted in such condition.
>>>
>>> However, what you want looks like atomic write semantics, which mostly database
>>> want to guarantee during db file update.
>>>
>>> F2FS has support atomic_write via ioctl, which is used by SQLite officially, I
>>> guess you can check its implementation detail.
>>>
>>> Thanks,
>>>
>>
>> Thanks for your kind reply.
>> It's true that if we meet power failure before fsync() completes,
>> POSIX doen't require FS to recover the file. However, consider the
>> following situation:
>>
>> 1) Data block IOs (Not persisted)
>> 2) Node block IOs (All Persisted)
>> 3) Power failure
>>
>> Since the node blocks are all persisted before power failure, the node
>> chain isn't broken. Note that this file's new data is not properly
>> persisted before crash. So the recovery process should be able to
>> recognize this situation and avoid recover this file. However, since
>> the node chain is not broken, perhaps the recovery process will regard
>> this file as recoverable?
> 
> As my own limited understanding, I'm afraid it seems true for extreme case.
> Without proper FLUSH command, newer nodes could be recovered but no newer
> data persisted.
> 
> So if fsync() is not successful, the old data should be readed
> but for this case, unexpected data (not A or A', could be random data
> C) will be considered validly since its node is ok.
> 
> It seems it should FLUSH data before the related node chain written or
> introduce some data checksum though.
> 
> If I am wrong, kindly correct me...

Yes, I guess if user wants more consistence guarantee of fsync() than posix one,
we can refactor fsync_mode=strict mode a bit to handle fsync() IOs like we did
for atomic write IOs to keep strict data/node IO order. But note that such
consistence guarantee is weak, after sudden power-cut, recovered file may
contain mixed old and new data (fsynced data partially persisted) which may also
crash the Apps.

Thanks,

> 
> Thanks,
> Gao Xiang
> 
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-12-09 10:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-07 10:10 [f2fs-dev] Potential data corruption? =?gb18030?B?uuzJ1bXEzf67r7H9?=
2019-12-08  4:00 ` Chao Yu
2019-12-08 13:15   ` Hongwei Qin
2019-12-08 13:41     ` Chao Yu
2019-12-08 13:51     ` Gao Xiang via Linux-f2fs-devel
2019-12-09 10:46       ` Chao Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.