All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
       [not found] <415E76CC-A53D-4643-88AB-3D7D7DC56F98@dubeyko.com>
@ 2012-10-06 13:54 ` Vyacheslav Dubeyko
  2012-10-06 20:06   ` Jaegeuk Kim
  0 siblings, 1 reply; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-06 13:54 UTC (permalink / raw)
  To: jaegeuk.kim
  Cc: Al Viro, tytso, gregkh, linux-kernel, chur.lee, cm224.lee,
	jooyoung.hwang, linux-fsdevel

Hi Jaegeuk,

> From:	 	김재극 <jaegeuk.kim@samsung.com>
> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>, gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com, jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> 
> This is a new patch set for the f2fs file system.
> 
> What is F2FS?
> =============
> 
> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> been widely being used for ranging from mobile to server systems. Since they are
> known to have different characteristics from the conventional rotational disks,
> a file system, an upper layer to the storage device, should adapt to the changes
> from the sketch.
> 
> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> devices. We chose a log structure file system approach, but we tried to adapt it
> to the new form of storage. Also we remedy some known issues of the very old log
> structured file system, such as snowball effect of wandering tree and high cleaning
> overhead.
> 
> Because a NAND-based storage device shows different characteristics according to
> its internal geometry or flash memory management scheme aka FTL, we add various
> parameters not only for configuring on-disk layout, but also for selecting allocation
> and cleaning algorithms.
> 

What about F2FS performance? Could you share benchmarking results of the new file system?

It is very interesting the case of aged file system. How is GC's implementation efficient? Could you share benchmarking results for the very aged file system state?

With the best regards,
Vyacheslav Dubeyko.

> Patch set
> =========
> 
> The patch #1 adds a document to Documentation/filesystems/.
> The patch #2 adds a header file of on-disk layout to include/linux/.
> The patches #3-#15 adds f2fs source files to fs/f2fs/.
> The Last patch, patch #16, updates Makefile and Kconfig.
> 
> mkfs.f2fs
> =========
> 
> The file system formatting tool, "mkfs.f2fs", is available from the following
> download page:		
> http://sourceforge.net/projects/f2fs-tools/
> 
> 
> Usage
> =====
> 
> If you'd like to experience f2fs, simply:
> # mkfs.f2fs /dev/sdb1
> # mount -t f2fs /dev/sdb1 /mnt/f2fs
> 
> Short log
> =========
> 
> Jaegeuk Kim (16):
>  f2fs: add document
>  f2fs: add on-disk layout
>  f2fs: add superblock and major in-memory structure
>  f2fs: add super block operations
>  f2fs: add checkpoint operations
>  f2fs: add node operations
>  f2fs: add segment operations
>  f2fs: add file operations
>  f2fs: add address space operations for data
>  f2fs: add core inode operations
>  f2fs: add inode operations for special inodes
>  f2fs: add core directory operations
>  f2fs: add xattr and acl functionalities
>  f2fs: add garbage collection functions
>  f2fs: add recovery routines for roll-forward
>  f2fs: update Kconfig and Makefile
> 
> Documentation/filesystems/00-INDEX |    2 +
> Documentation/filesystems/f2fs.txt |  314 +++++++
> fs/Kconfig                         |    1 +
> fs/Makefile                        |    1 +
> fs/f2fs/Kconfig                    |   55 ++
> fs/f2fs/Makefile                   |    6 +
> fs/f2fs/acl.c                      |  402 ++++++++
> fs/f2fs/acl.h                      |   57 ++
> fs/f2fs/checkpoint.c               |  791 ++++++++++++++++
> fs/f2fs/data.c                     |  700 ++++++++++++++
> fs/f2fs/dir.c                      |  657 +++++++++++++
> fs/f2fs/f2fs.h                     |  981 ++++++++++++++++++++
> fs/f2fs/file.c                     |  643 +++++++++++++
> fs/f2fs/gc.c                       | 1140 +++++++++++++++++++++++
> fs/f2fs/gc.h                       |  203 +++++
> fs/f2fs/hash.c                     |   98 ++
> fs/f2fs/inode.c                    |  258 ++++++
> fs/f2fs/namei.c                    |  549 +++++++++++
> fs/f2fs/node.c                     | 1773 ++++++++++++++++++++++++++++++++++++
> fs/f2fs/node.h                     |  331 +++++++
> fs/f2fs/recovery.c                 |  372 ++++++++
> fs/f2fs/segment.c                  | 1755 +++++++++++++++++++++++++++++++++++
> fs/f2fs/segment.h                  |  627 +++++++++++++
> fs/f2fs/super.c                    |  550 +++++++++++
> fs/f2fs/xattr.c                    |  387 ++++++++
> fs/f2fs/xattr.h                    |  142 +++
> include/linux/f2fs_fs.h            |  359 ++++++++
> 27 files changed, 13154 insertions(+)
> create mode 100644 Documentation/filesystems/f2fs.txt
> create mode 100644 fs/f2fs/Kconfig
> create mode 100644 fs/f2fs/Makefile
> create mode 100644 fs/f2fs/acl.c
> create mode 100644 fs/f2fs/acl.h
> create mode 100644 fs/f2fs/checkpoint.c
> create mode 100644 fs/f2fs/data.c
> create mode 100644 fs/f2fs/dir.c
> create mode 100644 fs/f2fs/f2fs.h
> create mode 100644 fs/f2fs/file.c
> create mode 100644 fs/f2fs/gc.c
> create mode 100644 fs/f2fs/gc.h
> create mode 100644 fs/f2fs/hash.c
> create mode 100644 fs/f2fs/inode.c
> create mode 100644 fs/f2fs/namei.c
> create mode 100644 fs/f2fs/node.c
> create mode 100644 fs/f2fs/node.h
> create mode 100644 fs/f2fs/recovery.c
> create mode 100644 fs/f2fs/segment.c
> create mode 100644 fs/f2fs/segment.h
> create mode 100644 fs/f2fs/super.c
> create mode 100644 fs/f2fs/xattr.c
> create mode 100644 fs/f2fs/xattr.h
> create mode 100644 include/linux/f2fs_fs.h
> 
> -- 
> 1.7.9.5
> 
> 
> 
> 
> ---
> Jaegeuk Kim
> Samsung
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 
> Please read the FAQ at http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-06 13:54 ` [PATCH 00/16] f2fs: introduce flash-friendly file system Vyacheslav Dubeyko
@ 2012-10-06 20:06   ` Jaegeuk Kim
  2012-10-07  7:09     ` Marco Stornelli
  2012-10-07 10:15       ` Vyacheslav Dubeyko
  0 siblings, 2 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-06 20:06 UTC (permalink / raw)
  To: Vyacheslav Dubeyko
  Cc: jaegeuk.kim, Al Viro, tytso, gregkh, linux-kernel, chur.lee,
	cm224.lee, jooyoung.hwang, linux-fsdevel

2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> Hi Jaegeuk,

Hi.
We know each other, right? :)

> 
> > From:	 	김재극 <jaegeuk.kim@samsung.com>
> > To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>, gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com, jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> > Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > 
> > This is a new patch set for the f2fs file system.
> > 
> > What is F2FS?
> > =============
> > 
> > NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> > been widely being used for ranging from mobile to server systems. Since they are
> > known to have different characteristics from the conventional rotational disks,
> > a file system, an upper layer to the storage device, should adapt to the changes
> > from the sketch.
> > 
> > F2FS is a new file system carefully designed for the NAND flash memory-based storage
> > devices. We chose a log structure file system approach, but we tried to adapt it
> > to the new form of storage. Also we remedy some known issues of the very old log
> > structured file system, such as snowball effect of wandering tree and high cleaning
> > overhead.
> > 
> > Because a NAND-based storage device shows different characteristics according to
> > its internal geometry or flash memory management scheme aka FTL, we add various
> > parameters not only for configuring on-disk layout, but also for selecting allocation
> > and cleaning algorithms.
> > 
> 
> What about F2FS performance? Could you share benchmarking results of the new file system?
> 
> It is very interesting the case of aged file system. How is GC's implementation efficient? Could you share benchmarking results for the very aged file system state?
> 

Although I have benchmark results, currently I'd like to see the results
measured by community as a black-box. As you know, the results are very
dependent on the workloads and parameters, so I think it would be better
to see other results for a while.
Thanks,

> With the best regards,
> Vyacheslav Dubeyko.
> 
> > Patch set
> > =========
> > 
> > The patch #1 adds a document to Documentation/filesystems/.
> > The patch #2 adds a header file of on-disk layout to include/linux/.
> > The patches #3-#15 adds f2fs source files to fs/f2fs/.
> > The Last patch, patch #16, updates Makefile and Kconfig.
> > 
> > mkfs.f2fs
> > =========
> > 
> > The file system formatting tool, "mkfs.f2fs", is available from the following
> > download page:		
> > http://sourceforge.net/projects/f2fs-tools/
> > 
> > 
> > Usage
> > =====
> > 
> > If you'd like to experience f2fs, simply:
> > # mkfs.f2fs /dev/sdb1
> > # mount -t f2fs /dev/sdb1 /mnt/f2fs
> > 
> > Short log
> > =========
> > 
> > Jaegeuk Kim (16):
> >  f2fs: add document
> >  f2fs: add on-disk layout
> >  f2fs: add superblock and major in-memory structure
> >  f2fs: add super block operations
> >  f2fs: add checkpoint operations
> >  f2fs: add node operations
> >  f2fs: add segment operations
> >  f2fs: add file operations
> >  f2fs: add address space operations for data
> >  f2fs: add core inode operations
> >  f2fs: add inode operations for special inodes
> >  f2fs: add core directory operations
> >  f2fs: add xattr and acl functionalities
> >  f2fs: add garbage collection functions
> >  f2fs: add recovery routines for roll-forward
> >  f2fs: update Kconfig and Makefile
> > 
> > Documentation/filesystems/00-INDEX |    2 +
> > Documentation/filesystems/f2fs.txt |  314 +++++++
> > fs/Kconfig                         |    1 +
> > fs/Makefile                        |    1 +
> > fs/f2fs/Kconfig                    |   55 ++
> > fs/f2fs/Makefile                   |    6 +
> > fs/f2fs/acl.c                      |  402 ++++++++
> > fs/f2fs/acl.h                      |   57 ++
> > fs/f2fs/checkpoint.c               |  791 ++++++++++++++++
> > fs/f2fs/data.c                     |  700 ++++++++++++++
> > fs/f2fs/dir.c                      |  657 +++++++++++++
> > fs/f2fs/f2fs.h                     |  981 ++++++++++++++++++++
> > fs/f2fs/file.c                     |  643 +++++++++++++
> > fs/f2fs/gc.c                       | 1140 +++++++++++++++++++++++
> > fs/f2fs/gc.h                       |  203 +++++
> > fs/f2fs/hash.c                     |   98 ++
> > fs/f2fs/inode.c                    |  258 ++++++
> > fs/f2fs/namei.c                    |  549 +++++++++++
> > fs/f2fs/node.c                     | 1773 ++++++++++++++++++++++++++++++++++++
> > fs/f2fs/node.h                     |  331 +++++++
> > fs/f2fs/recovery.c                 |  372 ++++++++
> > fs/f2fs/segment.c                  | 1755 +++++++++++++++++++++++++++++++++++
> > fs/f2fs/segment.h                  |  627 +++++++++++++
> > fs/f2fs/super.c                    |  550 +++++++++++
> > fs/f2fs/xattr.c                    |  387 ++++++++
> > fs/f2fs/xattr.h                    |  142 +++
> > include/linux/f2fs_fs.h            |  359 ++++++++
> > 27 files changed, 13154 insertions(+)
> > create mode 100644 Documentation/filesystems/f2fs.txt
> > create mode 100644 fs/f2fs/Kconfig
> > create mode 100644 fs/f2fs/Makefile
> > create mode 100644 fs/f2fs/acl.c
> > create mode 100644 fs/f2fs/acl.h
> > create mode 100644 fs/f2fs/checkpoint.c
> > create mode 100644 fs/f2fs/data.c
> > create mode 100644 fs/f2fs/dir.c
> > create mode 100644 fs/f2fs/f2fs.h
> > create mode 100644 fs/f2fs/file.c
> > create mode 100644 fs/f2fs/gc.c
> > create mode 100644 fs/f2fs/gc.h
> > create mode 100644 fs/f2fs/hash.c
> > create mode 100644 fs/f2fs/inode.c
> > create mode 100644 fs/f2fs/namei.c
> > create mode 100644 fs/f2fs/node.c
> > create mode 100644 fs/f2fs/node.h
> > create mode 100644 fs/f2fs/recovery.c
> > create mode 100644 fs/f2fs/segment.c
> > create mode 100644 fs/f2fs/segment.h
> > create mode 100644 fs/f2fs/super.c
> > create mode 100644 fs/f2fs/xattr.c
> > create mode 100644 fs/f2fs/xattr.h
> > create mode 100644 include/linux/f2fs_fs.h
> > 
> > -- 
> > 1.7.9.5
> > 
> > 
> > 
> > 
> > ---
> > Jaegeuk Kim
> > Samsung
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  
> > http://vger.kernel.org/majordomo-info.html
> > 
> > Please read the FAQ at http://www.tux.org/lkml/
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Jaegeuk Kim
Samsung


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-06 20:06   ` Jaegeuk Kim
@ 2012-10-07  7:09     ` Marco Stornelli
  2012-10-07  9:31         ` Jaegeuk Kim
  2012-10-07 10:15       ` Vyacheslav Dubeyko
  1 sibling, 1 reply; 154+ messages in thread
From: Marco Stornelli @ 2012-10-07  7:09 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Vyacheslav Dubeyko, jaegeuk.kim, Al Viro, tytso, gregkh,
	linux-kernel, chur.lee, cm224.lee, jooyoung.hwang, linux-fsdevel

Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>> Hi Jaegeuk,
>
> Hi.
> We know each other, right? :)
>
>>
>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>, gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com, jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>>>
>>> This is a new patch set for the f2fs file system.
>>>
>>> What is F2FS?
>>> =============
>>>
>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
>>> been widely being used for ranging from mobile to server systems. Since they are
>>> known to have different characteristics from the conventional rotational disks,
>>> a file system, an upper layer to the storage device, should adapt to the changes
>>> from the sketch.
>>>
>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
>>> devices. We chose a log structure file system approach, but we tried to adapt it
>>> to the new form of storage. Also we remedy some known issues of the very old log
>>> structured file system, such as snowball effect of wandering tree and high cleaning
>>> overhead.
>>>
>>> Because a NAND-based storage device shows different characteristics according to
>>> its internal geometry or flash memory management scheme aka FTL, we add various
>>> parameters not only for configuring on-disk layout, but also for selecting allocation
>>> and cleaning algorithms.
>>>
>>
>> What about F2FS performance? Could you share benchmarking results of the new file system?
>>
>> It is very interesting the case of aged file system. How is GC's implementation efficient? Could you share benchmarking results for the very aged file system state?
>>
>
> Although I have benchmark results, currently I'd like to see the results
> measured by community as a black-box. As you know, the results are very
> dependent on the workloads and parameters, so I think it would be better
> to see other results for a while.
> Thanks,
>

1) Actually it's a strange approach. If you have got any results you 
should share them with the community explaining how (the workload, hw 
and so on) your benchmark works and the specific condition. I really 
don't like the approach "I've got the results but I don't say anything, 
if you want a number, do it yourself".
2) For a new filesystem you should send the patches to linux-fsdevel.
3) It's not clear the pros/cons of your filesystem, can you share with 
us the main differences with the current fs already in mainline? Or is 
it a company secret?

Marco

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-07  7:09     ` Marco Stornelli
@ 2012-10-07  9:31         ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-07  9:31 UTC (permalink / raw)
  To: 'Marco Stornelli', 'Jaegeuk Kim'
  Cc: 'Vyacheslav Dubeyko', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> Sent: Sunday, October 07, 2012 4:10 PM
> To: Jaegeuk Kim
> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu; gregkh@linuxfoundation.org;
> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >> Hi Jaegeuk,
> >
> > Hi.
> > We know each other, right? :)
> >
> >>
> >>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com,
> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> >>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >>>
> >>> This is a new patch set for the f2fs file system.
> >>>
> >>> What is F2FS?
> >>> =============
> >>>
> >>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> >>> been widely being used for ranging from mobile to server systems. Since they are
> >>> known to have different characteristics from the conventional rotational disks,
> >>> a file system, an upper layer to the storage device, should adapt to the changes
> >>> from the sketch.
> >>>
> >>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> >>> devices. We chose a log structure file system approach, but we tried to adapt it
> >>> to the new form of storage. Also we remedy some known issues of the very old log
> >>> structured file system, such as snowball effect of wandering tree and high cleaning
> >>> overhead.
> >>>
> >>> Because a NAND-based storage device shows different characteristics according to
> >>> its internal geometry or flash memory management scheme aka FTL, we add various
> >>> parameters not only for configuring on-disk layout, but also for selecting allocation
> >>> and cleaning algorithms.
> >>>
> >>
> >> What about F2FS performance? Could you share benchmarking results of the new file system?
> >>
> >> It is very interesting the case of aged file system. How is GC's implementation efficient? Could
> you share benchmarking results for the very aged file system state?
> >>
> >
> > Although I have benchmark results, currently I'd like to see the results
> > measured by community as a black-box. As you know, the results are very
> > dependent on the workloads and parameters, so I think it would be better
> > to see other results for a while.
> > Thanks,
> >
> 
> 1) Actually it's a strange approach. If you have got any results you
> should share them with the community explaining how (the workload, hw
> and so on) your benchmark works and the specific condition. I really
> don't like the approach "I've got the results but I don't say anything,
> if you want a number, do it yourself".

It's definitely right, and I meant *for a while*.
I just wanted to avoid arguing with how to age file system in this time.
Before then, I share the primitive results as follows.

1. iozone in Panda board
 - ARM A9
 - DRAM : 1GB
 - Kernel: Linux 3.3
 - Partition: 12GB (64GB Samsung eMMC)
 - Tested on 2GB file

           seq. read, seq. write, rand. read, rand. write
 - ext4:    30.753         17.066       5.06         4.15
 - f2fs:    30.71          16.906       5.073       15.204

2. iozone in Galaxy Nexus
 - DRAM : 1GB
 - Android 4.0.4_r1.2
 - Kernel omap 3.0.8
 - Partition: /data, 12GB
 - Tested on 2GB file

           seq. read, seq. write, rand. read,  rand. write
 - ext4:    29.88        12.83         11.43          0.56
 - f2fs:    29.70        13.34         10.79         12.82

Due to the company secret, I expect to show other results after presenting f2fs at korea linux forum.

> 2) For a new filesystem you should send the patches to linux-fsdevel.

Yes, that was totally my mistake.

> 3) It's not clear the pros/cons of your filesystem, can you share with
> us the main differences with the current fs already in mainline? Or is
> it a company secret?

After forum, I can share the slides, and I hope they will be useful to you.

Instead, let me summarize at a glance compared with other file systems.
Here are several log-structured file systems.
Note that, F2FS operates on top of block device with consideration on the FTL behavior.
So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
LogFS is initially designed for raw NAND flash, but expanded to block device.
But, I don't know whether it is stable or not.
NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
IMO, that feature is quite promising and important to users, but it may degrade the performance.
There is a trade-off between functionalities and performance.
F2FS chose high performance without any further fancy functionalities.

Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from fundamental designs.
I don't know, but why not designing a new file system for flash storages as a counterpart?

> 
> Marco

---
Jaegeuk Kim
Samsung


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-07  9:31         ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-07  9:31 UTC (permalink / raw)
  To: 'Marco Stornelli', 'Jaegeuk Kim'
  Cc: 'Vyacheslav Dubeyko', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> Sent: Sunday, October 07, 2012 4:10 PM
> To: Jaegeuk Kim
> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu; gregkh@linuxfoundation.org;
> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >> Hi Jaegeuk,
> >
> > Hi.
> > We know each other, right? :)
> >
> >>
> >>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com,
> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> >>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >>>
> >>> This is a new patch set for the f2fs file system.
> >>>
> >>> What is F2FS?
> >>> =============
> >>>
> >>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> >>> been widely being used for ranging from mobile to server systems. Since they are
> >>> known to have different characteristics from the conventional rotational disks,
> >>> a file system, an upper layer to the storage device, should adapt to the changes
> >>> from the sketch.
> >>>
> >>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> >>> devices. We chose a log structure file system approach, but we tried to adapt it
> >>> to the new form of storage. Also we remedy some known issues of the very old log
> >>> structured file system, such as snowball effect of wandering tree and high cleaning
> >>> overhead.
> >>>
> >>> Because a NAND-based storage device shows different characteristics according to
> >>> its internal geometry or flash memory management scheme aka FTL, we add various
> >>> parameters not only for configuring on-disk layout, but also for selecting allocation
> >>> and cleaning algorithms.
> >>>
> >>
> >> What about F2FS performance? Could you share benchmarking results of the new file system?
> >>
> >> It is very interesting the case of aged file system. How is GC's implementation efficient? Could
> you share benchmarking results for the very aged file system state?
> >>
> >
> > Although I have benchmark results, currently I'd like to see the results
> > measured by community as a black-box. As you know, the results are very
> > dependent on the workloads and parameters, so I think it would be better
> > to see other results for a while.
> > Thanks,
> >
> 
> 1) Actually it's a strange approach. If you have got any results you
> should share them with the community explaining how (the workload, hw
> and so on) your benchmark works and the specific condition. I really
> don't like the approach "I've got the results but I don't say anything,
> if you want a number, do it yourself".

It's definitely right, and I meant *for a while*.
I just wanted to avoid arguing with how to age file system in this time.
Before then, I share the primitive results as follows.

1. iozone in Panda board
 - ARM A9
 - DRAM : 1GB
 - Kernel: Linux 3.3
 - Partition: 12GB (64GB Samsung eMMC)
 - Tested on 2GB file

           seq. read, seq. write, rand. read, rand. write
 - ext4:    30.753         17.066       5.06         4.15
 - f2fs:    30.71          16.906       5.073       15.204

2. iozone in Galaxy Nexus
 - DRAM : 1GB
 - Android 4.0.4_r1.2
 - Kernel omap 3.0.8
 - Partition: /data, 12GB
 - Tested on 2GB file

           seq. read, seq. write, rand. read,  rand. write
 - ext4:    29.88        12.83         11.43          0.56
 - f2fs:    29.70        13.34         10.79         12.82

Due to the company secret, I expect to show other results after presenting f2fs at korea linux forum.

> 2) For a new filesystem you should send the patches to linux-fsdevel.

Yes, that was totally my mistake.

> 3) It's not clear the pros/cons of your filesystem, can you share with
> us the main differences with the current fs already in mainline? Or is
> it a company secret?

After forum, I can share the slides, and I hope they will be useful to you.

Instead, let me summarize at a glance compared with other file systems.
Here are several log-structured file systems.
Note that, F2FS operates on top of block device with consideration on the FTL behavior.
So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
LogFS is initially designed for raw NAND flash, but expanded to block device.
But, I don't know whether it is stable or not.
NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
IMO, that feature is quite promising and important to users, but it may degrade the performance.
There is a trade-off between functionalities and performance.
F2FS chose high performance without any further fancy functionalities.

Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from fundamental designs.
I don't know, but why not designing a new file system for flash storages as a counterpart?

> 
> Marco

---
Jaegeuk Kim
Samsung

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-06 20:06   ` Jaegeuk Kim
@ 2012-10-07 10:15       ` Vyacheslav Dubeyko
  2012-10-07 10:15       ` Vyacheslav Dubeyko
  1 sibling, 0 replies; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-07 10:15 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: jaegeuk.kim, Al Viro, tytso, gregkh, linux-kernel, chur.lee,
	cm224.lee, jooyoung.hwang, linux-fsdevel


Hi,

On Oct 7, 2012, at 12:06 AM, Jaegeuk Kim wrote:

> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>> Hi Jaegeuk,
> 
> Hi.
> We know each other, right? :)
> 

Yes, you are correct. :-)

>> 
>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>, gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com, jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>>> 
>>> This is a new patch set for the f2fs file system.
>>> 
>>> What is F2FS?
>>> =============
>>> 
>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
>>> been widely being used for ranging from mobile to server systems. Since they are
>>> known to have different characteristics from the conventional rotational disks,
>>> a file system, an upper layer to the storage device, should adapt to the changes
>>> from the sketch.
>>> 
>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
>>> devices. We chose a log structure file system approach, but we tried to adapt it
>>> to the new form of storage. Also we remedy some known issues of the very old log
>>> structured file system, such as snowball effect of wandering tree and high cleaning
>>> overhead.
>>> 
>>> Because a NAND-based storage device shows different characteristics according to
>>> its internal geometry or flash memory management scheme aka FTL, we add various
>>> parameters not only for configuring on-disk layout, but also for selecting allocation
>>> and cleaning algorithms.
>>> 
>> 
>> What about F2FS performance? Could you share benchmarking results of the new file system?
>> 
>> It is very interesting the case of aged file system. How is GC's implementation efficient? Could you share benchmarking results for the very aged file system state?
>> 
> 
> Although I have benchmark results, currently I'd like to see the results
> measured by community as a black-box. As you know, the results are very
> dependent on the workloads and parameters, so I think it would be better
> to see other results for a while.
> Thanks,

It is a good strategy. But it exists known bottlenecks and, maybe, it makes sense to begin discussion in the community.

With the best regards,
Vyacheslav Dubeyko.

> 
>> With the best regards,
>> Vyacheslav Dubeyko.
>> 
>>> Patch set
>>> =========
>>> 
>>> The patch #1 adds a document to Documentation/filesystems/.
>>> The patch #2 adds a header file of on-disk layout to include/linux/.
>>> The patches #3-#15 adds f2fs source files to fs/f2fs/.
>>> The Last patch, patch #16, updates Makefile and Kconfig.
>>> 
>>> mkfs.f2fs
>>> =========
>>> 
>>> The file system formatting tool, "mkfs.f2fs", is available from the following
>>> download page:		
>>> http://sourceforge.net/projects/f2fs-tools/
>>> 
>>> 
>>> Usage
>>> =====
>>> 
>>> If you'd like to experience f2fs, simply:
>>> # mkfs.f2fs /dev/sdb1
>>> # mount -t f2fs /dev/sdb1 /mnt/f2fs
>>> 
>>> Short log
>>> =========
>>> 
>>> Jaegeuk Kim (16):
>>> f2fs: add document
>>> f2fs: add on-disk layout
>>> f2fs: add superblock and major in-memory structure
>>> f2fs: add super block operations
>>> f2fs: add checkpoint operations
>>> f2fs: add node operations
>>> f2fs: add segment operations
>>> f2fs: add file operations
>>> f2fs: add address space operations for data
>>> f2fs: add core inode operations
>>> f2fs: add inode operations for special inodes
>>> f2fs: add core directory operations
>>> f2fs: add xattr and acl functionalities
>>> f2fs: add garbage collection functions
>>> f2fs: add recovery routines for roll-forward
>>> f2fs: update Kconfig and Makefile
>>> 
>>> Documentation/filesystems/00-INDEX |    2 +
>>> Documentation/filesystems/f2fs.txt |  314 +++++++
>>> fs/Kconfig                         |    1 +
>>> fs/Makefile                        |    1 +
>>> fs/f2fs/Kconfig                    |   55 ++
>>> fs/f2fs/Makefile                   |    6 +
>>> fs/f2fs/acl.c                      |  402 ++++++++
>>> fs/f2fs/acl.h                      |   57 ++
>>> fs/f2fs/checkpoint.c               |  791 ++++++++++++++++
>>> fs/f2fs/data.c                     |  700 ++++++++++++++
>>> fs/f2fs/dir.c                      |  657 +++++++++++++
>>> fs/f2fs/f2fs.h                     |  981 ++++++++++++++++++++
>>> fs/f2fs/file.c                     |  643 +++++++++++++
>>> fs/f2fs/gc.c                       | 1140 +++++++++++++++++++++++
>>> fs/f2fs/gc.h                       |  203 +++++
>>> fs/f2fs/hash.c                     |   98 ++
>>> fs/f2fs/inode.c                    |  258 ++++++
>>> fs/f2fs/namei.c                    |  549 +++++++++++
>>> fs/f2fs/node.c                     | 1773 ++++++++++++++++++++++++++++++++++++
>>> fs/f2fs/node.h                     |  331 +++++++
>>> fs/f2fs/recovery.c                 |  372 ++++++++
>>> fs/f2fs/segment.c                  | 1755 +++++++++++++++++++++++++++++++++++
>>> fs/f2fs/segment.h                  |  627 +++++++++++++
>>> fs/f2fs/super.c                    |  550 +++++++++++
>>> fs/f2fs/xattr.c                    |  387 ++++++++
>>> fs/f2fs/xattr.h                    |  142 +++
>>> include/linux/f2fs_fs.h            |  359 ++++++++
>>> 27 files changed, 13154 insertions(+)
>>> create mode 100644 Documentation/filesystems/f2fs.txt
>>> create mode 100644 fs/f2fs/Kconfig
>>> create mode 100644 fs/f2fs/Makefile
>>> create mode 100644 fs/f2fs/acl.c
>>> create mode 100644 fs/f2fs/acl.h
>>> create mode 100644 fs/f2fs/checkpoint.c
>>> create mode 100644 fs/f2fs/data.c
>>> create mode 100644 fs/f2fs/dir.c
>>> create mode 100644 fs/f2fs/f2fs.h
>>> create mode 100644 fs/f2fs/file.c
>>> create mode 100644 fs/f2fs/gc.c
>>> create mode 100644 fs/f2fs/gc.h
>>> create mode 100644 fs/f2fs/hash.c
>>> create mode 100644 fs/f2fs/inode.c
>>> create mode 100644 fs/f2fs/namei.c
>>> create mode 100644 fs/f2fs/node.c
>>> create mode 100644 fs/f2fs/node.h
>>> create mode 100644 fs/f2fs/recovery.c
>>> create mode 100644 fs/f2fs/segment.c
>>> create mode 100644 fs/f2fs/segment.h
>>> create mode 100644 fs/f2fs/super.c
>>> create mode 100644 fs/f2fs/xattr.c
>>> create mode 100644 fs/f2fs/xattr.h
>>> create mode 100644 include/linux/f2fs_fs.h
>>> 
>>> -- 
>>> 1.7.9.5
>>> 
>>> 
>>> 
>>> 
>>> ---
>>> Jaegeuk Kim
>>> Samsung
>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  
>>> http://vger.kernel.org/majordomo-info.html
>>> 
>>> Please read the FAQ at http://www.tux.org/lkml/
>>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Jaegeuk Kim
> Samsung
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-07 10:15       ` Vyacheslav Dubeyko
  0 siblings, 0 replies; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-07 10:15 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: jaegeuk.kim, Al Viro, tytso, gregkh, linux-kernel, chur.lee,
	cm224.lee, jooyoung.hwang, linux-fsdevel


Hi,

On Oct 7, 2012, at 12:06 AM, Jaegeuk Kim wrote:

> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>> Hi Jaegeuk,
> 
> Hi.
> We know each other, right? :)
> 

Yes, you are correct. :-)

>> 
>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>, gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com, jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>>> 
>>> This is a new patch set for the f2fs file system.
>>> 
>>> What is F2FS?
>>> =============
>>> 
>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
>>> been widely being used for ranging from mobile to server systems. Since they are
>>> known to have different characteristics from the conventional rotational disks,
>>> a file system, an upper layer to the storage device, should adapt to the changes
>>> from the sketch.
>>> 
>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
>>> devices. We chose a log structure file system approach, but we tried to adapt it
>>> to the new form of storage. Also we remedy some known issues of the very old log
>>> structured file system, such as snowball effect of wandering tree and high cleaning
>>> overhead.
>>> 
>>> Because a NAND-based storage device shows different characteristics according to
>>> its internal geometry or flash memory management scheme aka FTL, we add various
>>> parameters not only for configuring on-disk layout, but also for selecting allocation
>>> and cleaning algorithms.
>>> 
>> 
>> What about F2FS performance? Could you share benchmarking results of the new file system?
>> 
>> It is very interesting the case of aged file system. How is GC's implementation efficient? Could you share benchmarking results for the very aged file system state?
>> 
> 
> Although I have benchmark results, currently I'd like to see the results
> measured by community as a black-box. As you know, the results are very
> dependent on the workloads and parameters, so I think it would be better
> to see other results for a while.
> Thanks,

It is a good strategy. But it exists known bottlenecks and, maybe, it makes sense to begin discussion in the community.

With the best regards,
Vyacheslav Dubeyko.

> 
>> With the best regards,
>> Vyacheslav Dubeyko.
>> 
>>> Patch set
>>> =========
>>> 
>>> The patch #1 adds a document to Documentation/filesystems/.
>>> The patch #2 adds a header file of on-disk layout to include/linux/.
>>> The patches #3-#15 adds f2fs source files to fs/f2fs/.
>>> The Last patch, patch #16, updates Makefile and Kconfig.
>>> 
>>> mkfs.f2fs
>>> =========
>>> 
>>> The file system formatting tool, "mkfs.f2fs", is available from the following
>>> download page:		
>>> http://sourceforge.net/projects/f2fs-tools/
>>> 
>>> 
>>> Usage
>>> =====
>>> 
>>> If you'd like to experience f2fs, simply:
>>> # mkfs.f2fs /dev/sdb1
>>> # mount -t f2fs /dev/sdb1 /mnt/f2fs
>>> 
>>> Short log
>>> =========
>>> 
>>> Jaegeuk Kim (16):
>>> f2fs: add document
>>> f2fs: add on-disk layout
>>> f2fs: add superblock and major in-memory structure
>>> f2fs: add super block operations
>>> f2fs: add checkpoint operations
>>> f2fs: add node operations
>>> f2fs: add segment operations
>>> f2fs: add file operations
>>> f2fs: add address space operations for data
>>> f2fs: add core inode operations
>>> f2fs: add inode operations for special inodes
>>> f2fs: add core directory operations
>>> f2fs: add xattr and acl functionalities
>>> f2fs: add garbage collection functions
>>> f2fs: add recovery routines for roll-forward
>>> f2fs: update Kconfig and Makefile
>>> 
>>> Documentation/filesystems/00-INDEX |    2 +
>>> Documentation/filesystems/f2fs.txt |  314 +++++++
>>> fs/Kconfig                         |    1 +
>>> fs/Makefile                        |    1 +
>>> fs/f2fs/Kconfig                    |   55 ++
>>> fs/f2fs/Makefile                   |    6 +
>>> fs/f2fs/acl.c                      |  402 ++++++++
>>> fs/f2fs/acl.h                      |   57 ++
>>> fs/f2fs/checkpoint.c               |  791 ++++++++++++++++
>>> fs/f2fs/data.c                     |  700 ++++++++++++++
>>> fs/f2fs/dir.c                      |  657 +++++++++++++
>>> fs/f2fs/f2fs.h                     |  981 ++++++++++++++++++++
>>> fs/f2fs/file.c                     |  643 +++++++++++++
>>> fs/f2fs/gc.c                       | 1140 +++++++++++++++++++++++
>>> fs/f2fs/gc.h                       |  203 +++++
>>> fs/f2fs/hash.c                     |   98 ++
>>> fs/f2fs/inode.c                    |  258 ++++++
>>> fs/f2fs/namei.c                    |  549 +++++++++++
>>> fs/f2fs/node.c                     | 1773 ++++++++++++++++++++++++++++++++++++
>>> fs/f2fs/node.h                     |  331 +++++++
>>> fs/f2fs/recovery.c                 |  372 ++++++++
>>> fs/f2fs/segment.c                  | 1755 +++++++++++++++++++++++++++++++++++
>>> fs/f2fs/segment.h                  |  627 +++++++++++++
>>> fs/f2fs/super.c                    |  550 +++++++++++
>>> fs/f2fs/xattr.c                    |  387 ++++++++
>>> fs/f2fs/xattr.h                    |  142 +++
>>> include/linux/f2fs_fs.h            |  359 ++++++++
>>> 27 files changed, 13154 insertions(+)
>>> create mode 100644 Documentation/filesystems/f2fs.txt
>>> create mode 100644 fs/f2fs/Kconfig
>>> create mode 100644 fs/f2fs/Makefile
>>> create mode 100644 fs/f2fs/acl.c
>>> create mode 100644 fs/f2fs/acl.h
>>> create mode 100644 fs/f2fs/checkpoint.c
>>> create mode 100644 fs/f2fs/data.c
>>> create mode 100644 fs/f2fs/dir.c
>>> create mode 100644 fs/f2fs/f2fs.h
>>> create mode 100644 fs/f2fs/file.c
>>> create mode 100644 fs/f2fs/gc.c
>>> create mode 100644 fs/f2fs/gc.h
>>> create mode 100644 fs/f2fs/hash.c
>>> create mode 100644 fs/f2fs/inode.c
>>> create mode 100644 fs/f2fs/namei.c
>>> create mode 100644 fs/f2fs/node.c
>>> create mode 100644 fs/f2fs/node.h
>>> create mode 100644 fs/f2fs/recovery.c
>>> create mode 100644 fs/f2fs/segment.c
>>> create mode 100644 fs/f2fs/segment.h
>>> create mode 100644 fs/f2fs/super.c
>>> create mode 100644 fs/f2fs/xattr.c
>>> create mode 100644 fs/f2fs/xattr.h
>>> create mode 100644 include/linux/f2fs_fs.h
>>> 
>>> -- 
>>> 1.7.9.5
>>> 
>>> 
>>> 
>>> 
>>> ---
>>> Jaegeuk Kim
>>> Samsung
>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  
>>> http://vger.kernel.org/majordomo-info.html
>>> 
>>> Please read the FAQ at http://www.tux.org/lkml/
>>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Jaegeuk Kim
> Samsung
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-07  9:31         ` Jaegeuk Kim
@ 2012-10-07 12:08           ` Vyacheslav Dubeyko
  -1 siblings, 0 replies; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-07 12:08 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

Hi,

On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:

>> -----Original Message-----
>> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
>> Sent: Sunday, October 07, 2012 4:10 PM
>> To: Jaegeuk Kim
>> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu; gregkh@linuxfoundation.org;
>> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
>> linux-fsdevel@vger.kernel.org
>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>> 
>> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
>>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>>>> Hi Jaegeuk,
>>> 
>>> Hi.
>>> We know each other, right? :)
>>> 
>>>> 
>>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
>> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com,
>> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
>>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>>>>> 
>>>>> This is a new patch set for the f2fs file system.
>>>>> 
>>>>> What is F2FS?
>>>>> =============
>>>>> 
>>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
>>>>> been widely being used for ranging from mobile to server systems. Since they are
>>>>> known to have different characteristics from the conventional rotational disks,
>>>>> a file system, an upper layer to the storage device, should adapt to the changes
>>>>> from the sketch.
>>>>> 
>>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
>>>>> devices. We chose a log structure file system approach, but we tried to adapt it
>>>>> to the new form of storage. Also we remedy some known issues of the very old log
>>>>> structured file system, such as snowball effect of wandering tree and high cleaning
>>>>> overhead.
>>>>> 
>>>>> Because a NAND-based storage device shows different characteristics according to
>>>>> its internal geometry or flash memory management scheme aka FTL, we add various
>>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
>>>>> and cleaning algorithms.
>>>>> 
>>>> 
>>>> What about F2FS performance? Could you share benchmarking results of the new file system?
>>>> 
>>>> It is very interesting the case of aged file system. How is GC's implementation efficient? Could
>> you share benchmarking results for the very aged file system state?
>>>> 
>>> 
>>> Although I have benchmark results, currently I'd like to see the results
>>> measured by community as a black-box. As you know, the results are very
>>> dependent on the workloads and parameters, so I think it would be better
>>> to see other results for a while.
>>> Thanks,
>>> 
>> 
>> 1) Actually it's a strange approach. If you have got any results you
>> should share them with the community explaining how (the workload, hw
>> and so on) your benchmark works and the specific condition. I really
>> don't like the approach "I've got the results but I don't say anything,
>> if you want a number, do it yourself".
> 
> It's definitely right, and I meant *for a while*.
> I just wanted to avoid arguing with how to age file system in this time.
> Before then, I share the primitive results as follows.
> 
> 1. iozone in Panda board
> - ARM A9
> - DRAM : 1GB
> - Kernel: Linux 3.3
> - Partition: 12GB (64GB Samsung eMMC)
> - Tested on 2GB file
> 
>           seq. read, seq. write, rand. read, rand. write
> - ext4:    30.753         17.066       5.06         4.15
> - f2fs:    30.71          16.906       5.073       15.204
> 
> 2. iozone in Galaxy Nexus
> - DRAM : 1GB
> - Android 4.0.4_r1.2
> - Kernel omap 3.0.8
> - Partition: /data, 12GB
> - Tested on 2GB file
> 
>           seq. read, seq. write, rand. read,  rand. write
> - ext4:    29.88        12.83         11.43          0.56
> - f2fs:    29.70        13.34         10.79         12.82
> 


This is results for non-aged filesystem state. Am I correct?


> Due to the company secret, I expect to show other results after presenting f2fs at korea linux forum.
> 
>> 2) For a new filesystem you should send the patches to linux-fsdevel.
> 
> Yes, that was totally my mistake.
> 
>> 3) It's not clear the pros/cons of your filesystem, can you share with
>> us the main differences with the current fs already in mainline? Or is
>> it a company secret?
> 
> After forum, I can share the slides, and I hope they will be useful to you.
> 
> Instead, let me summarize at a glance compared with other file systems.
> Here are several log-structured file systems.
> Note that, F2FS operates on top of block device with consideration on the FTL behavior.
> So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
> LogFS is initially designed for raw NAND flash, but expanded to block device.
> But, I don't know whether it is stable or not.
> NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> IMO, that feature is quite promising and important to users, but it may degrade the performance.
> There is a trade-off between functionalities and performance.
> F2FS chose high performance without any further fancy functionalities.
> 

Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used by users, so, it is very important to guarantee reliability of data keeping. Degradation of performance by means of snapshots is arguable point. Snapshots can solve the problem not only some unpredictable environmental issues but also user's erroneous behavior.

As I understand, it is not possible to have a perfect performance in all possible workloads. Could you point out what workloads are the best way of F2FS using?

> Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
> IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from fundamental designs.
> I don't know, but why not designing a new file system for flash storages as a counterpart?
> 

Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of flash storages' peculiarities. Could you point out key points of F2FS design that makes this design fundamentally unique?

With the best regards,
Vyacheslav Dubeyko.


>> 
>> Marco
> 
> ---
> Jaegeuk Kim
> Samsung
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-07 12:08           ` Vyacheslav Dubeyko
  0 siblings, 0 replies; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-07 12:08 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

Hi,

On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:

>> -----Original Message-----
>> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
>> Sent: Sunday, October 07, 2012 4:10 PM
>> To: Jaegeuk Kim
>> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu; gregkh@linuxfoundation.org;
>> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
>> linux-fsdevel@vger.kernel.org
>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>> 
>> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
>>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>>>> Hi Jaegeuk,
>>> 
>>> Hi.
>>> We know each other, right? :)
>>> 
>>>> 
>>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
>> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com, cm224.lee@samsung.com,
>> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
>>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>>>>> 
>>>>> This is a new patch set for the f2fs file system.
>>>>> 
>>>>> What is F2FS?
>>>>> =============
>>>>> 
>>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
>>>>> been widely being used for ranging from mobile to server systems. Since they are
>>>>> known to have different characteristics from the conventional rotational disks,
>>>>> a file system, an upper layer to the storage device, should adapt to the changes
>>>>> from the sketch.
>>>>> 
>>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
>>>>> devices. We chose a log structure file system approach, but we tried to adapt it
>>>>> to the new form of storage. Also we remedy some known issues of the very old log
>>>>> structured file system, such as snowball effect of wandering tree and high cleaning
>>>>> overhead.
>>>>> 
>>>>> Because a NAND-based storage device shows different characteristics according to
>>>>> its internal geometry or flash memory management scheme aka FTL, we add various
>>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
>>>>> and cleaning algorithms.
>>>>> 
>>>> 
>>>> What about F2FS performance? Could you share benchmarking results of the new file system?
>>>> 
>>>> It is very interesting the case of aged file system. How is GC's implementation efficient? Could
>> you share benchmarking results for the very aged file system state?
>>>> 
>>> 
>>> Although I have benchmark results, currently I'd like to see the results
>>> measured by community as a black-box. As you know, the results are very
>>> dependent on the workloads and parameters, so I think it would be better
>>> to see other results for a while.
>>> Thanks,
>>> 
>> 
>> 1) Actually it's a strange approach. If you have got any results you
>> should share them with the community explaining how (the workload, hw
>> and so on) your benchmark works and the specific condition. I really
>> don't like the approach "I've got the results but I don't say anything,
>> if you want a number, do it yourself".
> 
> It's definitely right, and I meant *for a while*.
> I just wanted to avoid arguing with how to age file system in this time.
> Before then, I share the primitive results as follows.
> 
> 1. iozone in Panda board
> - ARM A9
> - DRAM : 1GB
> - Kernel: Linux 3.3
> - Partition: 12GB (64GB Samsung eMMC)
> - Tested on 2GB file
> 
>           seq. read, seq. write, rand. read, rand. write
> - ext4:    30.753         17.066       5.06         4.15
> - f2fs:    30.71          16.906       5.073       15.204
> 
> 2. iozone in Galaxy Nexus
> - DRAM : 1GB
> - Android 4.0.4_r1.2
> - Kernel omap 3.0.8
> - Partition: /data, 12GB
> - Tested on 2GB file
> 
>           seq. read, seq. write, rand. read,  rand. write
> - ext4:    29.88        12.83         11.43          0.56
> - f2fs:    29.70        13.34         10.79         12.82
> 


This is results for non-aged filesystem state. Am I correct?


> Due to the company secret, I expect to show other results after presenting f2fs at korea linux forum.
> 
>> 2) For a new filesystem you should send the patches to linux-fsdevel.
> 
> Yes, that was totally my mistake.
> 
>> 3) It's not clear the pros/cons of your filesystem, can you share with
>> us the main differences with the current fs already in mainline? Or is
>> it a company secret?
> 
> After forum, I can share the slides, and I hope they will be useful to you.
> 
> Instead, let me summarize at a glance compared with other file systems.
> Here are several log-structured file systems.
> Note that, F2FS operates on top of block device with consideration on the FTL behavior.
> So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
> LogFS is initially designed for raw NAND flash, but expanded to block device.
> But, I don't know whether it is stable or not.
> NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> IMO, that feature is quite promising and important to users, but it may degrade the performance.
> There is a trade-off between functionalities and performance.
> F2FS chose high performance without any further fancy functionalities.
> 

Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used by users, so, it is very important to guarantee reliability of data keeping. Degradation of performance by means of snapshots is arguable point. Snapshots can solve the problem not only some unpredictable environmental issues but also user's erroneous behavior.

As I understand, it is not possible to have a perfect performance in all possible workloads. Could you point out what workloads are the best way of F2FS using?

> Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
> IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from fundamental designs.
> I don't know, but why not designing a new file system for flash storages as a counterpart?
> 

Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of flash storages' peculiarities. Could you point out key points of F2FS design that makes this design fundamentally unique?

With the best regards,
Vyacheslav Dubeyko.


>> 
>> Marco
> 
> ---
> Jaegeuk Kim
> Samsung
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-07 12:08           ` Vyacheslav Dubeyko
@ 2012-10-08  8:25             ` Jaegeuk Kim
  -1 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-08  8:25 UTC (permalink / raw)
  To: 'Vyacheslav Dubeyko'
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> Sent: Sunday, October 07, 2012 9:09 PM
> To: Jaegeuk Kim
> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> Hi,
> 
> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> 
> >> -----Original Message-----
> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> >> Sent: Sunday, October 07, 2012 4:10 PM
> >> To: Jaegeuk Kim
> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu; gregkh@linuxfoundation.org;
> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> jooyoung.hwang@samsung.com;
> >> linux-fsdevel@vger.kernel.org
> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>
> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >>>> Hi Jaegeuk,
> >>>
> >>> Hi.
> >>> We know each other, right? :)
> >>>
> >>>>
> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com,
> cm224.lee@samsung.com,
> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >>>>>
> >>>>> This is a new patch set for the f2fs file system.
> >>>>>
> >>>>> What is F2FS?
> >>>>> =============
> >>>>>
> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> >>>>> been widely being used for ranging from mobile to server systems. Since they are
> >>>>> known to have different characteristics from the conventional rotational disks,
> >>>>> a file system, an upper layer to the storage device, should adapt to the changes
> >>>>> from the sketch.
> >>>>>
> >>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> >>>>> devices. We chose a log structure file system approach, but we tried to adapt it
> >>>>> to the new form of storage. Also we remedy some known issues of the very old log
> >>>>> structured file system, such as snowball effect of wandering tree and high cleaning
> >>>>> overhead.
> >>>>>
> >>>>> Because a NAND-based storage device shows different characteristics according to
> >>>>> its internal geometry or flash memory management scheme aka FTL, we add various
> >>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
> >>>>> and cleaning algorithms.
> >>>>>
> >>>>
> >>>> What about F2FS performance? Could you share benchmarking results of the new file system?
> >>>>
> >>>> It is very interesting the case of aged file system. How is GC's implementation efficient? Could
> >> you share benchmarking results for the very aged file system state?
> >>>>
> >>>
> >>> Although I have benchmark results, currently I'd like to see the results
> >>> measured by community as a black-box. As you know, the results are very
> >>> dependent on the workloads and parameters, so I think it would be better
> >>> to see other results for a while.
> >>> Thanks,
> >>>
> >>
> >> 1) Actually it's a strange approach. If you have got any results you
> >> should share them with the community explaining how (the workload, hw
> >> and so on) your benchmark works and the specific condition. I really
> >> don't like the approach "I've got the results but I don't say anything,
> >> if you want a number, do it yourself".
> >
> > It's definitely right, and I meant *for a while*.
> > I just wanted to avoid arguing with how to age file system in this time.
> > Before then, I share the primitive results as follows.
> >
> > 1. iozone in Panda board
> > - ARM A9
> > - DRAM : 1GB
> > - Kernel: Linux 3.3
> > - Partition: 12GB (64GB Samsung eMMC)
> > - Tested on 2GB file
> >
> >           seq. read, seq. write, rand. read, rand. write
> > - ext4:    30.753         17.066       5.06         4.15
> > - f2fs:    30.71          16.906       5.073       15.204
> >
> > 2. iozone in Galaxy Nexus
> > - DRAM : 1GB
> > - Android 4.0.4_r1.2
> > - Kernel omap 3.0.8
> > - Partition: /data, 12GB
> > - Tested on 2GB file
> >
> >           seq. read, seq. write, rand. read,  rand. write
> > - ext4:    29.88        12.83         11.43          0.56
> > - f2fs:    29.70        13.34         10.79         12.82
> >
> 
> 
> This is results for non-aged filesystem state. Am I correct?
> 

Yes, right.

> 
> > Due to the company secret, I expect to show other results after presenting f2fs at korea linux forum.
> >
> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
> >
> > Yes, that was totally my mistake.
> >
> >> 3) It's not clear the pros/cons of your filesystem, can you share with
> >> us the main differences with the current fs already in mainline? Or is
> >> it a company secret?
> >
> > After forum, I can share the slides, and I hope they will be useful to you.
> >
> > Instead, let me summarize at a glance compared with other file systems.
> > Here are several log-structured file systems.
> > Note that, F2FS operates on top of block device with consideration on the FTL behavior.
> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
> > LogFS is initially designed for raw NAND flash, but expanded to block device.
> > But, I don't know whether it is stable or not.
> > NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> > IMO, that feature is quite promising and important to users, but it may degrade the performance.
> > There is a trade-off between functionalities and performance.
> > F2FS chose high performance without any further fancy functionalities.
> >
> 
> Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used by
> users, so, it is very important to guarantee reliability of data keeping. Degradation of performance
> by means of snapshots is arguable point. Snapshots can solve the problem not only some unpredictable
> environmental issues but also user's erroneous behavior.
> 

Yes, I agree. I concerned the multiple snapshot feature.
Of course, fault-tolerance is very important, and file system should support it as you know as power-off-recovery.
f2fs supports the recovery mechanism by adopting checkpoint similar to snapshot.
But, f2fs does not support multiple snapshots for user convenience.
I just focused on the performance, and absolutely, the multiple snapshot feature is also a good alternative approach.
That may be a trade-off.

> As I understand, it is not possible to have a perfect performance in all possible workloads. Could you
> point out what workloads are the best way of F2FS using?

Basically I think the following workloads will be good for F2FS.
- Many random writes : it's LFS nature
- Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.

> 
> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
> > IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from
> fundamental designs.
> > I don't know, but why not designing a new file system for flash storages as a counterpart?
> >
> 
> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-
> oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of
> flash storages' peculiarities. Could you point out key points of F2FS design that makes this design
> fundamentally unique?

As you can see the f2fs kernel document patch, I think one of the most important features is to align operating units between f2fs and ftl.
Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit respectively.
Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary operations done by FTL.
And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios likewise ext4.

> 
> With the best regards,
> Vyacheslav Dubeyko.
> 
> 
> >>
> >> Marco
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/


---
Jaegeuk Kim
Samsung


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-08  8:25             ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-08  8:25 UTC (permalink / raw)
  To: 'Vyacheslav Dubeyko'
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> Sent: Sunday, October 07, 2012 9:09 PM
> To: Jaegeuk Kim
> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> Hi,
> 
> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> 
> >> -----Original Message-----
> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> >> Sent: Sunday, October 07, 2012 4:10 PM
> >> To: Jaegeuk Kim
> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu; gregkh@linuxfoundation.org;
> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> jooyoung.hwang@samsung.com;
> >> linux-fsdevel@vger.kernel.org
> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>
> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >>>> Hi Jaegeuk,
> >>>
> >>> Hi.
> >>> We know each other, right? :)
> >>>
> >>>>
> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com,
> cm224.lee@samsung.com,
> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >>>>>
> >>>>> This is a new patch set for the f2fs file system.
> >>>>>
> >>>>> What is F2FS?
> >>>>> =============
> >>>>>
> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> >>>>> been widely being used for ranging from mobile to server systems. Since they are
> >>>>> known to have different characteristics from the conventional rotational disks,
> >>>>> a file system, an upper layer to the storage device, should adapt to the changes
> >>>>> from the sketch.
> >>>>>
> >>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> >>>>> devices. We chose a log structure file system approach, but we tried to adapt it
> >>>>> to the new form of storage. Also we remedy some known issues of the very old log
> >>>>> structured file system, such as snowball effect of wandering tree and high cleaning
> >>>>> overhead.
> >>>>>
> >>>>> Because a NAND-based storage device shows different characteristics according to
> >>>>> its internal geometry or flash memory management scheme aka FTL, we add various
> >>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
> >>>>> and cleaning algorithms.
> >>>>>
> >>>>
> >>>> What about F2FS performance? Could you share benchmarking results of the new file system?
> >>>>
> >>>> It is very interesting the case of aged file system. How is GC's implementation efficient? Could
> >> you share benchmarking results for the very aged file system state?
> >>>>
> >>>
> >>> Although I have benchmark results, currently I'd like to see the results
> >>> measured by community as a black-box. As you know, the results are very
> >>> dependent on the workloads and parameters, so I think it would be better
> >>> to see other results for a while.
> >>> Thanks,
> >>>
> >>
> >> 1) Actually it's a strange approach. If you have got any results you
> >> should share them with the community explaining how (the workload, hw
> >> and so on) your benchmark works and the specific condition. I really
> >> don't like the approach "I've got the results but I don't say anything,
> >> if you want a number, do it yourself".
> >
> > It's definitely right, and I meant *for a while*.
> > I just wanted to avoid arguing with how to age file system in this time.
> > Before then, I share the primitive results as follows.
> >
> > 1. iozone in Panda board
> > - ARM A9
> > - DRAM : 1GB
> > - Kernel: Linux 3.3
> > - Partition: 12GB (64GB Samsung eMMC)
> > - Tested on 2GB file
> >
> >           seq. read, seq. write, rand. read, rand. write
> > - ext4:    30.753         17.066       5.06         4.15
> > - f2fs:    30.71          16.906       5.073       15.204
> >
> > 2. iozone in Galaxy Nexus
> > - DRAM : 1GB
> > - Android 4.0.4_r1.2
> > - Kernel omap 3.0.8
> > - Partition: /data, 12GB
> > - Tested on 2GB file
> >
> >           seq. read, seq. write, rand. read,  rand. write
> > - ext4:    29.88        12.83         11.43          0.56
> > - f2fs:    29.70        13.34         10.79         12.82
> >
> 
> 
> This is results for non-aged filesystem state. Am I correct?
> 

Yes, right.

> 
> > Due to the company secret, I expect to show other results after presenting f2fs at korea linux forum.
> >
> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
> >
> > Yes, that was totally my mistake.
> >
> >> 3) It's not clear the pros/cons of your filesystem, can you share with
> >> us the main differences with the current fs already in mainline? Or is
> >> it a company secret?
> >
> > After forum, I can share the slides, and I hope they will be useful to you.
> >
> > Instead, let me summarize at a glance compared with other file systems.
> > Here are several log-structured file systems.
> > Note that, F2FS operates on top of block device with consideration on the FTL behavior.
> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
> > LogFS is initially designed for raw NAND flash, but expanded to block device.
> > But, I don't know whether it is stable or not.
> > NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> > IMO, that feature is quite promising and important to users, but it may degrade the performance.
> > There is a trade-off between functionalities and performance.
> > F2FS chose high performance without any further fancy functionalities.
> >
> 
> Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used by
> users, so, it is very important to guarantee reliability of data keeping. Degradation of performance
> by means of snapshots is arguable point. Snapshots can solve the problem not only some unpredictable
> environmental issues but also user's erroneous behavior.
> 

Yes, I agree. I concerned the multiple snapshot feature.
Of course, fault-tolerance is very important, and file system should support it as you know as power-off-recovery.
f2fs supports the recovery mechanism by adopting checkpoint similar to snapshot.
But, f2fs does not support multiple snapshots for user convenience.
I just focused on the performance, and absolutely, the multiple snapshot feature is also a good alternative approach.
That may be a trade-off.

> As I understand, it is not possible to have a perfect performance in all possible workloads. Could you
> point out what workloads are the best way of F2FS using?

Basically I think the following workloads will be good for F2FS.
- Many random writes : it's LFS nature
- Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.

> 
> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
> > IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from
> fundamental designs.
> > I don't know, but why not designing a new file system for flash storages as a counterpart?
> >
> 
> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-
> oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of
> flash storages' peculiarities. Could you point out key points of F2FS design that makes this design
> fundamentally unique?

As you can see the f2fs kernel document patch, I think one of the most important features is to align operating units between f2fs and ftl.
Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit respectively.
Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary operations done by FTL.
And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios likewise ext4.

> 
> With the best regards,
> Vyacheslav Dubeyko.
> 
> 
> >>
> >> Marco
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/


---
Jaegeuk Kim
Samsung

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-08  8:25             ` Jaegeuk Kim
@ 2012-10-08  9:59               ` Namjae Jeon
  -1 siblings, 0 replies; 154+ messages in thread
From: Namjae Jeon @ 2012-10-08  9:59 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Vyacheslav Dubeyko, Marco Stornelli, Jaegeuk Kim, Al Viro, tytso,
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
>> -----Original Message-----
>> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
>> Sent: Sunday, October 07, 2012 9:09 PM
>> To: Jaegeuk Kim
>> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
>> gregkh@linuxfoundation.org; linux-
>> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
>> jooyoung.hwang@samsung.com;
>> linux-fsdevel@vger.kernel.org
>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>>
>> Hi,
>>
>> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
>>
>> >> -----Original Message-----
>> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
>> >> Sent: Sunday, October 07, 2012 4:10 PM
>> >> To: Jaegeuk Kim
>> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
>> >> tytso@mit.edu; gregkh@linuxfoundation.org;
>> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
>> >> cm224.lee@samsung.com;
>> jooyoung.hwang@samsung.com;
>> >> linux-fsdevel@vger.kernel.org
>> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>> >>
>> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
>> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>> >>>> Hi Jaegeuk,
>> >>>
>> >>> Hi.
>> >>> We know each other, right? :)
>> >>>
>> >>>>
>> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
>> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
>> >> chur.lee@samsung.com,
>> cm224.lee@samsung.com,
>> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
>> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>> >>>>>
>> >>>>> This is a new patch set for the f2fs file system.
>> >>>>>
>> >>>>> What is F2FS?
>> >>>>> =============
>> >>>>>
>> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD
>> >>>>> cards, have
>> >>>>> been widely being used for ranging from mobile to server systems.
>> >>>>> Since they are
>> >>>>> known to have different characteristics from the conventional
>> >>>>> rotational disks,
>> >>>>> a file system, an upper layer to the storage device, should adapt to
>> >>>>> the changes
>> >>>>> from the sketch.
>> >>>>>
>> >>>>> F2FS is a new file system carefully designed for the NAND flash
>> >>>>> memory-based storage
>> >>>>> devices. We chose a log structure file system approach, but we tried
>> >>>>> to adapt it
>> >>>>> to the new form of storage. Also we remedy some known issues of the
>> >>>>> very old log
>> >>>>> structured file system, such as snowball effect of wandering tree
>> >>>>> and high cleaning
>> >>>>> overhead.
>> >>>>>
>> >>>>> Because a NAND-based storage device shows different characteristics
>> >>>>> according to
>> >>>>> its internal geometry or flash memory management scheme aka FTL, we
>> >>>>> add various
>> >>>>> parameters not only for configuring on-disk layout, but also for
>> >>>>> selecting allocation
>> >>>>> and cleaning algorithms.
>> >>>>>
>> >>>>
>> >>>> What about F2FS performance? Could you share benchmarking results of
>> >>>> the new file system?
>> >>>>
>> >>>> It is very interesting the case of aged file system. How is GC's
>> >>>> implementation efficient? Could
>> >> you share benchmarking results for the very aged file system state?
>> >>>>
>> >>>
>> >>> Although I have benchmark results, currently I'd like to see the
>> >>> results
>> >>> measured by community as a black-box. As you know, the results are
>> >>> very
>> >>> dependent on the workloads and parameters, so I think it would be
>> >>> better
>> >>> to see other results for a while.
>> >>> Thanks,
>> >>>
>> >>
>> >> 1) Actually it's a strange approach. If you have got any results you
>> >> should share them with the community explaining how (the workload, hw
>> >> and so on) your benchmark works and the specific condition. I really
>> >> don't like the approach "I've got the results but I don't say
>> >> anything,
>> >> if you want a number, do it yourself".
>> >
>> > It's definitely right, and I meant *for a while*.
>> > I just wanted to avoid arguing with how to age file system in this
>> > time.
>> > Before then, I share the primitive results as follows.
>> >
>> > 1. iozone in Panda board
>> > - ARM A9
>> > - DRAM : 1GB
>> > - Kernel: Linux 3.3
>> > - Partition: 12GB (64GB Samsung eMMC)
>> > - Tested on 2GB file
>> >
>> >           seq. read, seq. write, rand. read, rand. write
>> > - ext4:    30.753         17.066       5.06         4.15
>> > - f2fs:    30.71          16.906       5.073       15.204
>> >
>> > 2. iozone in Galaxy Nexus
>> > - DRAM : 1GB
>> > - Android 4.0.4_r1.2
>> > - Kernel omap 3.0.8
>> > - Partition: /data, 12GB
>> > - Tested on 2GB file
>> >
>> >           seq. read, seq. write, rand. read,  rand. write
>> > - ext4:    29.88        12.83         11.43          0.56
>> > - f2fs:    29.70        13.34         10.79         12.82
>> >
>>
>>
>> This is results for non-aged filesystem state. Am I correct?
>>
>
> Yes, right.
>
>>
>> > Due to the company secret, I expect to show other results after
>> > presenting f2fs at korea linux forum.
>> >
>> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
>> >
>> > Yes, that was totally my mistake.
>> >
>> >> 3) It's not clear the pros/cons of your filesystem, can you share with
>> >> us the main differences with the current fs already in mainline? Or is
>> >> it a company secret?
>> >
>> > After forum, I can share the slides, and I hope they will be useful to
>> > you.
>> >
>> > Instead, let me summarize at a glance compared with other file systems.
>> > Here are several log-structured file systems.
>> > Note that, F2FS operates on top of block device with consideration on
>> > the FTL behavior.
>> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed
>> > for raw NAND flash.
>> > LogFS is initially designed for raw NAND flash, but expanded to block
>> > device.
>> > But, I don't know whether it is stable or not.
>> > NILFS2 is one of major log-structured file systems, which supports
>> > multiple snap-shots.
>> > IMO, that feature is quite promising and important to users, but it may
>> > degrade the performance.
>> > There is a trade-off between functionalities and performance.
>> > F2FS chose high performance without any further fancy functionalities.
>> >
>>
>> Performance is a good goal. But fault-tolerance is also very important
>> point. Filesystems are used by
>> users, so, it is very important to guarantee reliability of data keeping.
>> Degradation of performance
>> by means of snapshots is arguable point. Snapshots can solve the problem
>> not only some unpredictable
>> environmental issues but also user's erroneous behavior.
>>
>
> Yes, I agree. I concerned the multiple snapshot feature.
> Of course, fault-tolerance is very important, and file system should support
> it as you know as power-off-recovery.
> f2fs supports the recovery mechanism by adopting checkpoint similar to
> snapshot.
> But, f2fs does not support multiple snapshots for user convenience.
> I just focused on the performance, and absolutely, the multiple snapshot
> feature is also a good alternative approach.
> That may be a trade-off.
>
>> As I understand, it is not possible to have a perfect performance in all
>> possible workloads. Could you
>> point out what workloads are the best way of F2FS using?
>
> Basically I think the following workloads will be good for F2FS.
> - Many random writes : it's LFS nature
> - Small writes with frequent fsync : f2fs is optimized to reduce the fsync
> overhead.
>
>>
>> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
>> > storages.
>> > IMHO, however, they are originally designed for HDDs, so that it may or
>> > may not suffer from
>> fundamental designs.
>> > I don't know, but why not designing a new file system for flash storages
>> > as a counterpart?
>> >
>>
>> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2,
>> YAFFS2, UBIFS but block-
>> oriented filesystem. So, F2FS design is restricted by block-layer's
>> opportunities in the using of
>> flash storages' peculiarities. Could you point out key points of F2FS
>> design that makes this design
>> fundamentally unique?
>
> As you can see the f2fs kernel document patch, I think one of the most
> important features is to align operating units between f2fs and ftl.
> Specifically, f2fs has section and zone, which are cleaning unit and basic
> allocation unit respectively.
> Through these configurable units in f2fs, I think f2fs is able to reduce the
> unnecessary operations done by FTL.
> And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> itself some bios likewise ext4.
Hello.
The internal of eMMC and SSD is the blackbox from user side.
How does the normal user easily set operating units alignment(page
size and physical block size ?) between f2fs and ftl in storage device
?

Thanks.

>
>>
>> With the best regards,
>> Vyacheslav Dubeyko.
>>
>>
>> >>
>> >> Marco
>> >
>> > ---
>> > Jaegeuk Kim
>> > Samsung
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>> > in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at  http://www.tux.org/lkml/
>
>
> ---
> Jaegeuk Kim
> Samsung
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-08  9:59               ` Namjae Jeon
  0 siblings, 0 replies; 154+ messages in thread
From: Namjae Jeon @ 2012-10-08  9:59 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Vyacheslav Dubeyko, Marco Stornelli, Jaegeuk Kim, Al Viro, tytso,
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
>> -----Original Message-----
>> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
>> Sent: Sunday, October 07, 2012 9:09 PM
>> To: Jaegeuk Kim
>> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
>> gregkh@linuxfoundation.org; linux-
>> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
>> jooyoung.hwang@samsung.com;
>> linux-fsdevel@vger.kernel.org
>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>>
>> Hi,
>>
>> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
>>
>> >> -----Original Message-----
>> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
>> >> Sent: Sunday, October 07, 2012 4:10 PM
>> >> To: Jaegeuk Kim
>> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
>> >> tytso@mit.edu; gregkh@linuxfoundation.org;
>> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
>> >> cm224.lee@samsung.com;
>> jooyoung.hwang@samsung.com;
>> >> linux-fsdevel@vger.kernel.org
>> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>> >>
>> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
>> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>> >>>> Hi Jaegeuk,
>> >>>
>> >>> Hi.
>> >>> We know each other, right? :)
>> >>>
>> >>>>
>> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
>> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
>> >> chur.lee@samsung.com,
>> cm224.lee@samsung.com,
>> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
>> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>> >>>>>
>> >>>>> This is a new patch set for the f2fs file system.
>> >>>>>
>> >>>>> What is F2FS?
>> >>>>> =============
>> >>>>>
>> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD
>> >>>>> cards, have
>> >>>>> been widely being used for ranging from mobile to server systems.
>> >>>>> Since they are
>> >>>>> known to have different characteristics from the conventional
>> >>>>> rotational disks,
>> >>>>> a file system, an upper layer to the storage device, should adapt to
>> >>>>> the changes
>> >>>>> from the sketch.
>> >>>>>
>> >>>>> F2FS is a new file system carefully designed for the NAND flash
>> >>>>> memory-based storage
>> >>>>> devices. We chose a log structure file system approach, but we tried
>> >>>>> to adapt it
>> >>>>> to the new form of storage. Also we remedy some known issues of the
>> >>>>> very old log
>> >>>>> structured file system, such as snowball effect of wandering tree
>> >>>>> and high cleaning
>> >>>>> overhead.
>> >>>>>
>> >>>>> Because a NAND-based storage device shows different characteristics
>> >>>>> according to
>> >>>>> its internal geometry or flash memory management scheme aka FTL, we
>> >>>>> add various
>> >>>>> parameters not only for configuring on-disk layout, but also for
>> >>>>> selecting allocation
>> >>>>> and cleaning algorithms.
>> >>>>>
>> >>>>
>> >>>> What about F2FS performance? Could you share benchmarking results of
>> >>>> the new file system?
>> >>>>
>> >>>> It is very interesting the case of aged file system. How is GC's
>> >>>> implementation efficient? Could
>> >> you share benchmarking results for the very aged file system state?
>> >>>>
>> >>>
>> >>> Although I have benchmark results, currently I'd like to see the
>> >>> results
>> >>> measured by community as a black-box. As you know, the results are
>> >>> very
>> >>> dependent on the workloads and parameters, so I think it would be
>> >>> better
>> >>> to see other results for a while.
>> >>> Thanks,
>> >>>
>> >>
>> >> 1) Actually it's a strange approach. If you have got any results you
>> >> should share them with the community explaining how (the workload, hw
>> >> and so on) your benchmark works and the specific condition. I really
>> >> don't like the approach "I've got the results but I don't say
>> >> anything,
>> >> if you want a number, do it yourself".
>> >
>> > It's definitely right, and I meant *for a while*.
>> > I just wanted to avoid arguing with how to age file system in this
>> > time.
>> > Before then, I share the primitive results as follows.
>> >
>> > 1. iozone in Panda board
>> > - ARM A9
>> > - DRAM : 1GB
>> > - Kernel: Linux 3.3
>> > - Partition: 12GB (64GB Samsung eMMC)
>> > - Tested on 2GB file
>> >
>> >           seq. read, seq. write, rand. read, rand. write
>> > - ext4:    30.753         17.066       5.06         4.15
>> > - f2fs:    30.71          16.906       5.073       15.204
>> >
>> > 2. iozone in Galaxy Nexus
>> > - DRAM : 1GB
>> > - Android 4.0.4_r1.2
>> > - Kernel omap 3.0.8
>> > - Partition: /data, 12GB
>> > - Tested on 2GB file
>> >
>> >           seq. read, seq. write, rand. read,  rand. write
>> > - ext4:    29.88        12.83         11.43          0.56
>> > - f2fs:    29.70        13.34         10.79         12.82
>> >
>>
>>
>> This is results for non-aged filesystem state. Am I correct?
>>
>
> Yes, right.
>
>>
>> > Due to the company secret, I expect to show other results after
>> > presenting f2fs at korea linux forum.
>> >
>> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
>> >
>> > Yes, that was totally my mistake.
>> >
>> >> 3) It's not clear the pros/cons of your filesystem, can you share with
>> >> us the main differences with the current fs already in mainline? Or is
>> >> it a company secret?
>> >
>> > After forum, I can share the slides, and I hope they will be useful to
>> > you.
>> >
>> > Instead, let me summarize at a glance compared with other file systems.
>> > Here are several log-structured file systems.
>> > Note that, F2FS operates on top of block device with consideration on
>> > the FTL behavior.
>> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed
>> > for raw NAND flash.
>> > LogFS is initially designed for raw NAND flash, but expanded to block
>> > device.
>> > But, I don't know whether it is stable or not.
>> > NILFS2 is one of major log-structured file systems, which supports
>> > multiple snap-shots.
>> > IMO, that feature is quite promising and important to users, but it may
>> > degrade the performance.
>> > There is a trade-off between functionalities and performance.
>> > F2FS chose high performance without any further fancy functionalities.
>> >
>>
>> Performance is a good goal. But fault-tolerance is also very important
>> point. Filesystems are used by
>> users, so, it is very important to guarantee reliability of data keeping.
>> Degradation of performance
>> by means of snapshots is arguable point. Snapshots can solve the problem
>> not only some unpredictable
>> environmental issues but also user's erroneous behavior.
>>
>
> Yes, I agree. I concerned the multiple snapshot feature.
> Of course, fault-tolerance is very important, and file system should support
> it as you know as power-off-recovery.
> f2fs supports the recovery mechanism by adopting checkpoint similar to
> snapshot.
> But, f2fs does not support multiple snapshots for user convenience.
> I just focused on the performance, and absolutely, the multiple snapshot
> feature is also a good alternative approach.
> That may be a trade-off.
>
>> As I understand, it is not possible to have a perfect performance in all
>> possible workloads. Could you
>> point out what workloads are the best way of F2FS using?
>
> Basically I think the following workloads will be good for F2FS.
> - Many random writes : it's LFS nature
> - Small writes with frequent fsync : f2fs is optimized to reduce the fsync
> overhead.
>
>>
>> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
>> > storages.
>> > IMHO, however, they are originally designed for HDDs, so that it may or
>> > may not suffer from
>> fundamental designs.
>> > I don't know, but why not designing a new file system for flash storages
>> > as a counterpart?
>> >
>>
>> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2,
>> YAFFS2, UBIFS but block-
>> oriented filesystem. So, F2FS design is restricted by block-layer's
>> opportunities in the using of
>> flash storages' peculiarities. Could you point out key points of F2FS
>> design that makes this design
>> fundamentally unique?
>
> As you can see the f2fs kernel document patch, I think one of the most
> important features is to align operating units between f2fs and ftl.
> Specifically, f2fs has section and zone, which are cleaning unit and basic
> allocation unit respectively.
> Through these configurable units in f2fs, I think f2fs is able to reduce the
> unnecessary operations done by FTL.
> And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> itself some bios likewise ext4.
Hello.
The internal of eMMC and SSD is the blackbox from user side.
How does the normal user easily set operating units alignment(page
size and physical block size ?) between f2fs and ftl in storage device
?

Thanks.

>
>>
>> With the best regards,
>> Vyacheslav Dubeyko.
>>
>>
>> >>
>> >> Marco
>> >
>> > ---
>> > Jaegeuk Kim
>> > Samsung
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>> > in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at  http://www.tux.org/lkml/
>
>
> ---
> Jaegeuk Kim
> Samsung
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-08  9:59               ` Namjae Jeon
  (?)
@ 2012-10-08 10:52               ` Jaegeuk Kim
  2012-10-08 11:21                 ` Namjae Jeon
  2012-10-09  8:31                 ` Lukáš Czerner
  -1 siblings, 2 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-08 10:52 UTC (permalink / raw)
  To: 'Namjae Jeon'
  Cc: 'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> Sent: Monday, October 08, 2012 7:00 PM
> To: Jaegeuk Kim
> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; tytso@mit.edu;
> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> >> -----Original Message-----
> >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> >> Sent: Sunday, October 07, 2012 9:09 PM
> >> To: Jaegeuk Kim
> >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> >> gregkh@linuxfoundation.org; linux-
> >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> >> jooyoung.hwang@samsung.com;
> >> linux-fsdevel@vger.kernel.org
> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>
> >> Hi,
> >>
> >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> >>
> >> >> -----Original Message-----
> >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> >> >> Sent: Sunday, October 07, 2012 4:10 PM
> >> >> To: Jaegeuk Kim
> >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
> >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
> >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> >> >> cm224.lee@samsung.com;
> >> jooyoung.hwang@samsung.com;
> >> >> linux-fsdevel@vger.kernel.org
> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >> >>
> >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >> >>>> Hi Jaegeuk,
> >> >>>
> >> >>> Hi.
> >> >>> We know each other, right? :)
> >> >>>
> >> >>>>
> >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> >> >> chur.lee@samsung.com,
> >> cm224.lee@samsung.com,
> >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >> >>>>>
> >> >>>>> This is a new patch set for the f2fs file system.
> >> >>>>>
> >> >>>>> What is F2FS?
> >> >>>>> =============
> >> >>>>>
> >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD
> >> >>>>> cards, have
> >> >>>>> been widely being used for ranging from mobile to server systems.
> >> >>>>> Since they are
> >> >>>>> known to have different characteristics from the conventional
> >> >>>>> rotational disks,
> >> >>>>> a file system, an upper layer to the storage device, should adapt to
> >> >>>>> the changes
> >> >>>>> from the sketch.
> >> >>>>>
> >> >>>>> F2FS is a new file system carefully designed for the NAND flash
> >> >>>>> memory-based storage
> >> >>>>> devices. We chose a log structure file system approach, but we tried
> >> >>>>> to adapt it
> >> >>>>> to the new form of storage. Also we remedy some known issues of the
> >> >>>>> very old log
> >> >>>>> structured file system, such as snowball effect of wandering tree
> >> >>>>> and high cleaning
> >> >>>>> overhead.
> >> >>>>>
> >> >>>>> Because a NAND-based storage device shows different characteristics
> >> >>>>> according to
> >> >>>>> its internal geometry or flash memory management scheme aka FTL, we
> >> >>>>> add various
> >> >>>>> parameters not only for configuring on-disk layout, but also for
> >> >>>>> selecting allocation
> >> >>>>> and cleaning algorithms.
> >> >>>>>
> >> >>>>
> >> >>>> What about F2FS performance? Could you share benchmarking results of
> >> >>>> the new file system?
> >> >>>>
> >> >>>> It is very interesting the case of aged file system. How is GC's
> >> >>>> implementation efficient? Could
> >> >> you share benchmarking results for the very aged file system state?
> >> >>>>
> >> >>>
> >> >>> Although I have benchmark results, currently I'd like to see the
> >> >>> results
> >> >>> measured by community as a black-box. As you know, the results are
> >> >>> very
> >> >>> dependent on the workloads and parameters, so I think it would be
> >> >>> better
> >> >>> to see other results for a while.
> >> >>> Thanks,
> >> >>>
> >> >>
> >> >> 1) Actually it's a strange approach. If you have got any results you
> >> >> should share them with the community explaining how (the workload, hw
> >> >> and so on) your benchmark works and the specific condition. I really
> >> >> don't like the approach "I've got the results but I don't say
> >> >> anything,
> >> >> if you want a number, do it yourself".
> >> >
> >> > It's definitely right, and I meant *for a while*.
> >> > I just wanted to avoid arguing with how to age file system in this
> >> > time.
> >> > Before then, I share the primitive results as follows.
> >> >
> >> > 1. iozone in Panda board
> >> > - ARM A9
> >> > - DRAM : 1GB
> >> > - Kernel: Linux 3.3
> >> > - Partition: 12GB (64GB Samsung eMMC)
> >> > - Tested on 2GB file
> >> >
> >> >           seq. read, seq. write, rand. read, rand. write
> >> > - ext4:    30.753         17.066       5.06         4.15
> >> > - f2fs:    30.71          16.906       5.073       15.204
> >> >
> >> > 2. iozone in Galaxy Nexus
> >> > - DRAM : 1GB
> >> > - Android 4.0.4_r1.2
> >> > - Kernel omap 3.0.8
> >> > - Partition: /data, 12GB
> >> > - Tested on 2GB file
> >> >
> >> >           seq. read, seq. write, rand. read,  rand. write
> >> > - ext4:    29.88        12.83         11.43          0.56
> >> > - f2fs:    29.70        13.34         10.79         12.82
> >> >
> >>
> >>
> >> This is results for non-aged filesystem state. Am I correct?
> >>
> >
> > Yes, right.
> >
> >>
> >> > Due to the company secret, I expect to show other results after
> >> > presenting f2fs at korea linux forum.
> >> >
> >> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
> >> >
> >> > Yes, that was totally my mistake.
> >> >
> >> >> 3) It's not clear the pros/cons of your filesystem, can you share with
> >> >> us the main differences with the current fs already in mainline? Or is
> >> >> it a company secret?
> >> >
> >> > After forum, I can share the slides, and I hope they will be useful to
> >> > you.
> >> >
> >> > Instead, let me summarize at a glance compared with other file systems.
> >> > Here are several log-structured file systems.
> >> > Note that, F2FS operates on top of block device with consideration on
> >> > the FTL behavior.
> >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed
> >> > for raw NAND flash.
> >> > LogFS is initially designed for raw NAND flash, but expanded to block
> >> > device.
> >> > But, I don't know whether it is stable or not.
> >> > NILFS2 is one of major log-structured file systems, which supports
> >> > multiple snap-shots.
> >> > IMO, that feature is quite promising and important to users, but it may
> >> > degrade the performance.
> >> > There is a trade-off between functionalities and performance.
> >> > F2FS chose high performance without any further fancy functionalities.
> >> >
> >>
> >> Performance is a good goal. But fault-tolerance is also very important
> >> point. Filesystems are used by
> >> users, so, it is very important to guarantee reliability of data keeping.
> >> Degradation of performance
> >> by means of snapshots is arguable point. Snapshots can solve the problem
> >> not only some unpredictable
> >> environmental issues but also user's erroneous behavior.
> >>
> >
> > Yes, I agree. I concerned the multiple snapshot feature.
> > Of course, fault-tolerance is very important, and file system should support
> > it as you know as power-off-recovery.
> > f2fs supports the recovery mechanism by adopting checkpoint similar to
> > snapshot.
> > But, f2fs does not support multiple snapshots for user convenience.
> > I just focused on the performance, and absolutely, the multiple snapshot
> > feature is also a good alternative approach.
> > That may be a trade-off.
> >
> >> As I understand, it is not possible to have a perfect performance in all
> >> possible workloads. Could you
> >> point out what workloads are the best way of F2FS using?
> >
> > Basically I think the following workloads will be good for F2FS.
> > - Many random writes : it's LFS nature
> > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync
> > overhead.
> >
> >>
> >> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
> >> > storages.
> >> > IMHO, however, they are originally designed for HDDs, so that it may or
> >> > may not suffer from
> >> fundamental designs.
> >> > I don't know, but why not designing a new file system for flash storages
> >> > as a counterpart?
> >> >
> >>
> >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2,
> >> YAFFS2, UBIFS but block-
> >> oriented filesystem. So, F2FS design is restricted by block-layer's
> >> opportunities in the using of
> >> flash storages' peculiarities. Could you point out key points of F2FS
> >> design that makes this design
> >> fundamentally unique?
> >
> > As you can see the f2fs kernel document patch, I think one of the most
> > important features is to align operating units between f2fs and ftl.
> > Specifically, f2fs has section and zone, which are cleaning unit and basic
> > allocation unit respectively.
> > Through these configurable units in f2fs, I think f2fs is able to reduce the
> > unnecessary operations done by FTL.
> > And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> > itself some bios likewise ext4.
> Hello.
> The internal of eMMC and SSD is the blackbox from user side.
> How does the normal user easily set operating units alignment(page
> size and physical block size ?) between f2fs and ftl in storage device
> ?

I've known that some works have been tried to figure out the units by profiling the storage, AKA reverse engineering.
In most cases, the simplest way is to measure the latencies of consecutive writes and analyze their patterns.
As you mentioned, in practical, users will not want to do this, so maybe we need a tool to profile them to optimize f2fs.
In the current state, I think profiling is an another issue, and mkfs.f2fs had better include this work in the future.
But, IMO, from the viewpoint of performance, default configuration is quite enough now.

ps) f2fs doesn't care about the flash page size, but considers garbage collection unit.

> 
> Thanks.
> 
> >
> >>
> >> With the best regards,
> >> Vyacheslav Dubeyko.
> >>
> >>
> >> >>
> >> >> Marco
> >> >
> >> > ---
> >> > Jaegeuk Kim
> >> > Samsung
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> >> > in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >


---
Jaegeuk Kim
Samsung



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-08 10:52               ` Jaegeuk Kim
@ 2012-10-08 11:21                 ` Namjae Jeon
  2012-10-08 12:11                   ` Jaegeuk Kim
  2012-10-09  8:31                 ` Lukáš Czerner
  1 sibling, 1 reply; 154+ messages in thread
From: Namjae Jeon @ 2012-10-08 11:21 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Vyacheslav Dubeyko, Marco Stornelli, Jaegeuk Kim, Al Viro, tytso,
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
>> -----Original Message-----
>> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
>> Sent: Monday, October 08, 2012 7:00 PM
>> To: Jaegeuk Kim
>> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro;
>> tytso@mit.edu;
>> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>> chur.lee@samsung.com; cm224.lee@samsung.com;
>> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>>
>> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
>> >> -----Original Message-----
>> >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
>> >> Sent: Sunday, October 07, 2012 9:09 PM
>> >> To: Jaegeuk Kim
>> >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
>> >> gregkh@linuxfoundation.org; linux-
>> >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
>> >> jooyoung.hwang@samsung.com;
>> >> linux-fsdevel@vger.kernel.org
>> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>> >>
>> >> Hi,
>> >>
>> >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
>> >>
>> >> >> -----Original Message-----
>> >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
>> >> >> Sent: Sunday, October 07, 2012 4:10 PM
>> >> >> To: Jaegeuk Kim
>> >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
>> >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
>> >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
>> >> >> cm224.lee@samsung.com;
>> >> jooyoung.hwang@samsung.com;
>> >> >> linux-fsdevel@vger.kernel.org
>> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file
>> >> >> system
>> >> >>
>> >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
>> >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>> >> >>>> Hi Jaegeuk,
>> >> >>>
>> >> >>> Hi.
>> >> >>> We know each other, right? :)
>> >> >>>
>> >> >>>>
>> >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>> >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
>> >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
>> >> >> chur.lee@samsung.com,
>> >> cm224.lee@samsung.com,
>> >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>> >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file
>> >> >>>>> system
>> >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>> >> >>>>>
>> >> >>>>> This is a new patch set for the f2fs file system.
>> >> >>>>>
>> >> >>>>> What is F2FS?
>> >> >>>>> =============
>> >> >>>>>
>> >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and
>> >> >>>>> SD
>> >> >>>>> cards, have
>> >> >>>>> been widely being used for ranging from mobile to server
>> >> >>>>> systems.
>> >> >>>>> Since they are
>> >> >>>>> known to have different characteristics from the conventional
>> >> >>>>> rotational disks,
>> >> >>>>> a file system, an upper layer to the storage device, should adapt
>> >> >>>>> to
>> >> >>>>> the changes
>> >> >>>>> from the sketch.
>> >> >>>>>
>> >> >>>>> F2FS is a new file system carefully designed for the NAND flash
>> >> >>>>> memory-based storage
>> >> >>>>> devices. We chose a log structure file system approach, but we
>> >> >>>>> tried
>> >> >>>>> to adapt it
>> >> >>>>> to the new form of storage. Also we remedy some known issues of
>> >> >>>>> the
>> >> >>>>> very old log
>> >> >>>>> structured file system, such as snowball effect of wandering
>> >> >>>>> tree
>> >> >>>>> and high cleaning
>> >> >>>>> overhead.
>> >> >>>>>
>> >> >>>>> Because a NAND-based storage device shows different
>> >> >>>>> characteristics
>> >> >>>>> according to
>> >> >>>>> its internal geometry or flash memory management scheme aka FTL,
>> >> >>>>> we
>> >> >>>>> add various
>> >> >>>>> parameters not only for configuring on-disk layout, but also for
>> >> >>>>> selecting allocation
>> >> >>>>> and cleaning algorithms.
>> >> >>>>>
>> >> >>>>
>> >> >>>> What about F2FS performance? Could you share benchmarking results
>> >> >>>> of
>> >> >>>> the new file system?
>> >> >>>>
>> >> >>>> It is very interesting the case of aged file system. How is GC's
>> >> >>>> implementation efficient? Could
>> >> >> you share benchmarking results for the very aged file system state?
>> >> >>>>
>> >> >>>
>> >> >>> Although I have benchmark results, currently I'd like to see the
>> >> >>> results
>> >> >>> measured by community as a black-box. As you know, the results are
>> >> >>> very
>> >> >>> dependent on the workloads and parameters, so I think it would be
>> >> >>> better
>> >> >>> to see other results for a while.
>> >> >>> Thanks,
>> >> >>>
>> >> >>
>> >> >> 1) Actually it's a strange approach. If you have got any results
>> >> >> you
>> >> >> should share them with the community explaining how (the workload,
>> >> >> hw
>> >> >> and so on) your benchmark works and the specific condition. I
>> >> >> really
>> >> >> don't like the approach "I've got the results but I don't say
>> >> >> anything,
>> >> >> if you want a number, do it yourself".
>> >> >
>> >> > It's definitely right, and I meant *for a while*.
>> >> > I just wanted to avoid arguing with how to age file system in this
>> >> > time.
>> >> > Before then, I share the primitive results as follows.
>> >> >
>> >> > 1. iozone in Panda board
>> >> > - ARM A9
>> >> > - DRAM : 1GB
>> >> > - Kernel: Linux 3.3
>> >> > - Partition: 12GB (64GB Samsung eMMC)
>> >> > - Tested on 2GB file
>> >> >
>> >> >           seq. read, seq. write, rand. read, rand. write
>> >> > - ext4:    30.753         17.066       5.06         4.15
>> >> > - f2fs:    30.71          16.906       5.073       15.204
>> >> >
>> >> > 2. iozone in Galaxy Nexus
>> >> > - DRAM : 1GB
>> >> > - Android 4.0.4_r1.2
>> >> > - Kernel omap 3.0.8
>> >> > - Partition: /data, 12GB
>> >> > - Tested on 2GB file
>> >> >
>> >> >           seq. read, seq. write, rand. read,  rand. write
>> >> > - ext4:    29.88        12.83         11.43          0.56
>> >> > - f2fs:    29.70        13.34         10.79         12.82
>> >> >
>> >>
>> >>
>> >> This is results for non-aged filesystem state. Am I correct?
>> >>
>> >
>> > Yes, right.
>> >
>> >>
>> >> > Due to the company secret, I expect to show other results after
>> >> > presenting f2fs at korea linux forum.
>> >> >
>> >> >> 2) For a new filesystem you should send the patches to
>> >> >> linux-fsdevel.
>> >> >
>> >> > Yes, that was totally my mistake.
>> >> >
>> >> >> 3) It's not clear the pros/cons of your filesystem, can you share
>> >> >> with
>> >> >> us the main differences with the current fs already in mainline? Or
>> >> >> is
>> >> >> it a company secret?
>> >> >
>> >> > After forum, I can share the slides, and I hope they will be useful
>> >> > to
>> >> > you.
>> >> >
>> >> > Instead, let me summarize at a glance compared with other file
>> >> > systems.
>> >> > Here are several log-structured file systems.
>> >> > Note that, F2FS operates on top of block device with consideration
>> >> > on
>> >> > the FTL behavior.
>> >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are
>> >> > designed
>> >> > for raw NAND flash.
>> >> > LogFS is initially designed for raw NAND flash, but expanded to
>> >> > block
>> >> > device.
>> >> > But, I don't know whether it is stable or not.
>> >> > NILFS2 is one of major log-structured file systems, which supports
>> >> > multiple snap-shots.
>> >> > IMO, that feature is quite promising and important to users, but it
>> >> > may
>> >> > degrade the performance.
>> >> > There is a trade-off between functionalities and performance.
>> >> > F2FS chose high performance without any further fancy
>> >> > functionalities.
>> >> >
>> >>
>> >> Performance is a good goal. But fault-tolerance is also very important
>> >> point. Filesystems are used by
>> >> users, so, it is very important to guarantee reliability of data
>> >> keeping.
>> >> Degradation of performance
>> >> by means of snapshots is arguable point. Snapshots can solve the
>> >> problem
>> >> not only some unpredictable
>> >> environmental issues but also user's erroneous behavior.
>> >>
>> >
>> > Yes, I agree. I concerned the multiple snapshot feature.
>> > Of course, fault-tolerance is very important, and file system should
>> > support
>> > it as you know as power-off-recovery.
>> > f2fs supports the recovery mechanism by adopting checkpoint similar to
>> > snapshot.
>> > But, f2fs does not support multiple snapshots for user convenience.
>> > I just focused on the performance, and absolutely, the multiple
>> > snapshot
>> > feature is also a good alternative approach.
>> > That may be a trade-off.
>> >
>> >> As I understand, it is not possible to have a perfect performance in
>> >> all
>> >> possible workloads. Could you
>> >> point out what workloads are the best way of F2FS using?
>> >
>> > Basically I think the following workloads will be good for F2FS.
>> > - Many random writes : it's LFS nature
>> > - Small writes with frequent fsync : f2fs is optimized to reduce the
>> > fsync
>> > overhead.
>> >
>> >>
>> >> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
>> >> > storages.
>> >> > IMHO, however, they are originally designed for HDDs, so that it may
>> >> > or
>> >> > may not suffer from
>> >> fundamental designs.
>> >> > I don't know, but why not designing a new file system for flash
>> >> > storages
>> >> > as a counterpart?
>> >> >
>> >>
>> >> Yes, it is possible. But F2FS is not flash oriented filesystem as
>> >> JFFS2,
>> >> YAFFS2, UBIFS but block-
>> >> oriented filesystem. So, F2FS design is restricted by block-layer's
>> >> opportunities in the using of
>> >> flash storages' peculiarities. Could you point out key points of F2FS
>> >> design that makes this design
>> >> fundamentally unique?
>> >
>> > As you can see the f2fs kernel document patch, I think one of the most
>> > important features is to align operating units between f2fs and ftl.
>> > Specifically, f2fs has section and zone, which are cleaning unit and
>> > basic
>> > allocation unit respectively.
>> > Through these configurable units in f2fs, I think f2fs is able to reduce
>> > the
>> > unnecessary operations done by FTL.
>> > And, in order to avoid changing IO patterns by the block-layer, f2fs
>> > merges
>> > itself some bios likewise ext4.
>> Hello.
>> The internal of eMMC and SSD is the blackbox from user side.
>> How does the normal user easily set operating units alignment(page
>> size and physical block size ?) between f2fs and ftl in storage device
>> ?
>
> I've known that some works have been tried to figure out the units by
> profiling the storage, AKA reverse engineering.
> In most cases, the simplest way is to measure the latencies of consecutive
> writes and analyze their patterns.
> As you mentioned, in practical, users will not want to do this, so maybe we
> need a tool to profile them to optimize f2fs.
> In the current state, I think profiling is an another issue, and mkfs.f2fs
> had better include this work in the future.
Well, Format tool evaluates optimal block size whenever formatting? As
you know, The size of Flash Based storage device is increasing every
year. It means format time can be too long on larger devices(e.g. one
device, one parition).
> But, IMO, from the viewpoint of performance, default configuration is quite
> enough now.
At default(after cleanly format), Would you share performance
difference between other log structured filesystems in comparison to
f2fs instead of ext4 ?

Thanks.
>
> ps) f2fs doesn't care about the flash page size, but considers garbage
> collection unit.
>
>>
>> Thanks.
>>
>> >
>> >>
>> >> With the best regards,
>> >> Vyacheslav Dubeyko.
>> >>
>> >>
>> >> >>
>> >> >> Marco
>> >> >
>> >> > ---
>> >> > Jaegeuk Kim
>> >> > Samsung
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe
>> >> > linux-kernel"
>> >> > in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> > Please read the FAQ at  http://www.tux.org/lkml/
>> >
>> >
>> > ---
>> > Jaegeuk Kim
>> > Samsung
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
>> > in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>
>
> ---
> Jaegeuk Kim
> Samsung
>
>
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-08 11:21                 ` Namjae Jeon
@ 2012-10-08 12:11                   ` Jaegeuk Kim
  2012-10-09  3:52                     ` Namjae Jeon
  0 siblings, 1 reply; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-08 12:11 UTC (permalink / raw)
  To: 'Namjae Jeon'
  Cc: 'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> Sent: Monday, October 08, 2012 8:22 PM
> To: Jaegeuk Kim
> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; tytso@mit.edu;
> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> >> -----Original Message-----
> >> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> >> Sent: Monday, October 08, 2012 7:00 PM
> >> To: Jaegeuk Kim
> >> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro;
> >> tytso@mit.edu;
> >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> chur.lee@samsung.com; cm224.lee@samsung.com;
> >> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>
> >> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> >> >> -----Original Message-----
> >> >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> >> >> Sent: Sunday, October 07, 2012 9:09 PM
> >> >> To: Jaegeuk Kim
> >> >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> >> >> gregkh@linuxfoundation.org; linux-
> >> >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> >> >> jooyoung.hwang@samsung.com;
> >> >> linux-fsdevel@vger.kernel.org
> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >> >>
> >> >> Hi,
> >> >>
> >> >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> >> >>
> >> >> >> -----Original Message-----
> >> >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> >> >> >> Sent: Sunday, October 07, 2012 4:10 PM
> >> >> >> To: Jaegeuk Kim
> >> >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
> >> >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
> >> >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> >> >> >> cm224.lee@samsung.com;
> >> >> jooyoung.hwang@samsung.com;
> >> >> >> linux-fsdevel@vger.kernel.org
> >> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file
> >> >> >> system
> >> >> >>
> >> >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> >> >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >> >> >>>> Hi Jaegeuk,
> >> >> >>>
> >> >> >>> Hi.
> >> >> >>> We know each other, right? :)
> >> >> >>>
> >> >> >>>>
> >> >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >> >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> >> >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> >> >> >> chur.lee@samsung.com,
> >> >> cm224.lee@samsung.com,
> >> >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >> >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file
> >> >> >>>>> system
> >> >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >> >> >>>>>
> >> >> >>>>> This is a new patch set for the f2fs file system.
> >> >> >>>>>
> >> >> >>>>> What is F2FS?
> >> >> >>>>> =============
> >> >> >>>>>
> >> >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and
> >> >> >>>>> SD
> >> >> >>>>> cards, have
> >> >> >>>>> been widely being used for ranging from mobile to server
> >> >> >>>>> systems.
> >> >> >>>>> Since they are
> >> >> >>>>> known to have different characteristics from the conventional
> >> >> >>>>> rotational disks,
> >> >> >>>>> a file system, an upper layer to the storage device, should adapt
> >> >> >>>>> to
> >> >> >>>>> the changes
> >> >> >>>>> from the sketch.
> >> >> >>>>>
> >> >> >>>>> F2FS is a new file system carefully designed for the NAND flash
> >> >> >>>>> memory-based storage
> >> >> >>>>> devices. We chose a log structure file system approach, but we
> >> >> >>>>> tried
> >> >> >>>>> to adapt it
> >> >> >>>>> to the new form of storage. Also we remedy some known issues of
> >> >> >>>>> the
> >> >> >>>>> very old log
> >> >> >>>>> structured file system, such as snowball effect of wandering
> >> >> >>>>> tree
> >> >> >>>>> and high cleaning
> >> >> >>>>> overhead.
> >> >> >>>>>
> >> >> >>>>> Because a NAND-based storage device shows different
> >> >> >>>>> characteristics
> >> >> >>>>> according to
> >> >> >>>>> its internal geometry or flash memory management scheme aka FTL,
> >> >> >>>>> we
> >> >> >>>>> add various
> >> >> >>>>> parameters not only for configuring on-disk layout, but also for
> >> >> >>>>> selecting allocation
> >> >> >>>>> and cleaning algorithms.
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>> What about F2FS performance? Could you share benchmarking results
> >> >> >>>> of
> >> >> >>>> the new file system?
> >> >> >>>>
> >> >> >>>> It is very interesting the case of aged file system. How is GC's
> >> >> >>>> implementation efficient? Could
> >> >> >> you share benchmarking results for the very aged file system state?
> >> >> >>>>
> >> >> >>>
> >> >> >>> Although I have benchmark results, currently I'd like to see the
> >> >> >>> results
> >> >> >>> measured by community as a black-box. As you know, the results are
> >> >> >>> very
> >> >> >>> dependent on the workloads and parameters, so I think it would be
> >> >> >>> better
> >> >> >>> to see other results for a while.
> >> >> >>> Thanks,
> >> >> >>>
> >> >> >>
> >> >> >> 1) Actually it's a strange approach. If you have got any results
> >> >> >> you
> >> >> >> should share them with the community explaining how (the workload,
> >> >> >> hw
> >> >> >> and so on) your benchmark works and the specific condition. I
> >> >> >> really
> >> >> >> don't like the approach "I've got the results but I don't say
> >> >> >> anything,
> >> >> >> if you want a number, do it yourself".
> >> >> >
> >> >> > It's definitely right, and I meant *for a while*.
> >> >> > I just wanted to avoid arguing with how to age file system in this
> >> >> > time.
> >> >> > Before then, I share the primitive results as follows.
> >> >> >
> >> >> > 1. iozone in Panda board
> >> >> > - ARM A9
> >> >> > - DRAM : 1GB
> >> >> > - Kernel: Linux 3.3
> >> >> > - Partition: 12GB (64GB Samsung eMMC)
> >> >> > - Tested on 2GB file
> >> >> >
> >> >> >           seq. read, seq. write, rand. read, rand. write
> >> >> > - ext4:    30.753         17.066       5.06         4.15
> >> >> > - f2fs:    30.71          16.906       5.073       15.204
> >> >> >
> >> >> > 2. iozone in Galaxy Nexus
> >> >> > - DRAM : 1GB
> >> >> > - Android 4.0.4_r1.2
> >> >> > - Kernel omap 3.0.8
> >> >> > - Partition: /data, 12GB
> >> >> > - Tested on 2GB file
> >> >> >
> >> >> >           seq. read, seq. write, rand. read,  rand. write
> >> >> > - ext4:    29.88        12.83         11.43          0.56
> >> >> > - f2fs:    29.70        13.34         10.79         12.82
> >> >> >
> >> >>
> >> >>
> >> >> This is results for non-aged filesystem state. Am I correct?
> >> >>
> >> >
> >> > Yes, right.
> >> >
> >> >>
> >> >> > Due to the company secret, I expect to show other results after
> >> >> > presenting f2fs at korea linux forum.
> >> >> >
> >> >> >> 2) For a new filesystem you should send the patches to
> >> >> >> linux-fsdevel.
> >> >> >
> >> >> > Yes, that was totally my mistake.
> >> >> >
> >> >> >> 3) It's not clear the pros/cons of your filesystem, can you share
> >> >> >> with
> >> >> >> us the main differences with the current fs already in mainline? Or
> >> >> >> is
> >> >> >> it a company secret?
> >> >> >
> >> >> > After forum, I can share the slides, and I hope they will be useful
> >> >> > to
> >> >> > you.
> >> >> >
> >> >> > Instead, let me summarize at a glance compared with other file
> >> >> > systems.
> >> >> > Here are several log-structured file systems.
> >> >> > Note that, F2FS operates on top of block device with consideration
> >> >> > on
> >> >> > the FTL behavior.
> >> >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are
> >> >> > designed
> >> >> > for raw NAND flash.
> >> >> > LogFS is initially designed for raw NAND flash, but expanded to
> >> >> > block
> >> >> > device.
> >> >> > But, I don't know whether it is stable or not.
> >> >> > NILFS2 is one of major log-structured file systems, which supports
> >> >> > multiple snap-shots.
> >> >> > IMO, that feature is quite promising and important to users, but it
> >> >> > may
> >> >> > degrade the performance.
> >> >> > There is a trade-off between functionalities and performance.
> >> >> > F2FS chose high performance without any further fancy
> >> >> > functionalities.
> >> >> >
> >> >>
> >> >> Performance is a good goal. But fault-tolerance is also very important
> >> >> point. Filesystems are used by
> >> >> users, so, it is very important to guarantee reliability of data
> >> >> keeping.
> >> >> Degradation of performance
> >> >> by means of snapshots is arguable point. Snapshots can solve the
> >> >> problem
> >> >> not only some unpredictable
> >> >> environmental issues but also user's erroneous behavior.
> >> >>
> >> >
> >> > Yes, I agree. I concerned the multiple snapshot feature.
> >> > Of course, fault-tolerance is very important, and file system should
> >> > support
> >> > it as you know as power-off-recovery.
> >> > f2fs supports the recovery mechanism by adopting checkpoint similar to
> >> > snapshot.
> >> > But, f2fs does not support multiple snapshots for user convenience.
> >> > I just focused on the performance, and absolutely, the multiple
> >> > snapshot
> >> > feature is also a good alternative approach.
> >> > That may be a trade-off.
> >> >
> >> >> As I understand, it is not possible to have a perfect performance in
> >> >> all
> >> >> possible workloads. Could you
> >> >> point out what workloads are the best way of F2FS using?
> >> >
> >> > Basically I think the following workloads will be good for F2FS.
> >> > - Many random writes : it's LFS nature
> >> > - Small writes with frequent fsync : f2fs is optimized to reduce the
> >> > fsync
> >> > overhead.
> >> >
> >> >>
> >> >> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
> >> >> > storages.
> >> >> > IMHO, however, they are originally designed for HDDs, so that it may
> >> >> > or
> >> >> > may not suffer from
> >> >> fundamental designs.
> >> >> > I don't know, but why not designing a new file system for flash
> >> >> > storages
> >> >> > as a counterpart?
> >> >> >
> >> >>
> >> >> Yes, it is possible. But F2FS is not flash oriented filesystem as
> >> >> JFFS2,
> >> >> YAFFS2, UBIFS but block-
> >> >> oriented filesystem. So, F2FS design is restricted by block-layer's
> >> >> opportunities in the using of
> >> >> flash storages' peculiarities. Could you point out key points of F2FS
> >> >> design that makes this design
> >> >> fundamentally unique?
> >> >
> >> > As you can see the f2fs kernel document patch, I think one of the most
> >> > important features is to align operating units between f2fs and ftl.
> >> > Specifically, f2fs has section and zone, which are cleaning unit and
> >> > basic
> >> > allocation unit respectively.
> >> > Through these configurable units in f2fs, I think f2fs is able to reduce
> >> > the
> >> > unnecessary operations done by FTL.
> >> > And, in order to avoid changing IO patterns by the block-layer, f2fs
> >> > merges
> >> > itself some bios likewise ext4.
> >> Hello.
> >> The internal of eMMC and SSD is the blackbox from user side.
> >> How does the normal user easily set operating units alignment(page
> >> size and physical block size ?) between f2fs and ftl in storage device
> >> ?
> >
> > I've known that some works have been tried to figure out the units by
> > profiling the storage, AKA reverse engineering.
> > In most cases, the simplest way is to measure the latencies of consecutive
> > writes and analyze their patterns.
> > As you mentioned, in practical, users will not want to do this, so maybe we
> > need a tool to profile them to optimize f2fs.
> > In the current state, I think profiling is an another issue, and mkfs.f2fs
> > had better include this work in the future.
> Well, Format tool evaluates optimal block size whenever formatting? As
> you know, The size of Flash Based storage device is increasing every
> year. It means format time can be too long on larger devices(e.g. one
> device, one parition).

Every file systems will suffer from the long format time in such a huge device.
And, I don't think the profiling time would not be scaled up, since it's unnecessary to scan whole device.
After getting the size, we just can stop it.

> > But, IMO, from the viewpoint of performance, default configuration is quite
> > enough now.
> At default(after cleanly format), Would you share performance
> difference between other log structured filesystems in comparison to
> f2fs instead of ext4 ?
> 

Actually, we've focused on ext4, so I have no results of other file systems measured on embedded systems.
I'll test sooner or later, and report them.
Thank you for valuable comments.

> Thanks.
> >
> > ps) f2fs doesn't care about the flash page size, but considers garbage
> > collection unit.
> >
> >>
> >> Thanks.
> >>
> >> >
> >> >>
> >> >> With the best regards,
> >> >> Vyacheslav Dubeyko.
> >> >>
> >> >>
> >> >> >>
> >> >> >> Marco
> >> >> >
> >> >> > ---
> >> >> > Jaegeuk Kim
> >> >> > Samsung
> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe
> >> >> > linux-kernel"
> >> >> > in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >> >
> >> >
> >> > ---
> >> > Jaegeuk Kim
> >> > Samsung
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> >> > in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> >
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >
> >
> >


---
Jaegeuk Kim
Samsung


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-08  8:25             ` Jaegeuk Kim
  (?)
  (?)
@ 2012-10-08 19:22             ` Vyacheslav Dubeyko
  2012-10-09  7:08                 ` Jaegeuk Kim
  -1 siblings, 1 reply; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-08 19:22 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

Hi,

On Oct 8, 2012, at 12:25 PM, Jaegeuk Kim wrote:

>> -----Original Message-----
>> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
>> Sent: Sunday, October 07, 2012 9:09 PM
>> To: Jaegeuk Kim
>> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
>> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
>> linux-fsdevel@vger.kernel.org
>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>> 
>> Hi,
>> 
>> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
>> 
>>>> -----Original Message-----
>>>> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
>>>> Sent: Sunday, October 07, 2012 4:10 PM
>>>> To: Jaegeuk Kim
>>>> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu; gregkh@linuxfoundation.org;
>>>> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
>> jooyoung.hwang@samsung.com;
>>>> linux-fsdevel@vger.kernel.org
>>>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>>>> 
>>>> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
>>>>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>>>>>> Hi Jaegeuk,
>>>>> 
>>>>> Hi.
>>>>> We know each other, right? :)
>>>>> 
>>>>>> 
>>>>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>>>>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
>>>> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com,
>> cm224.lee@samsung.com,
>>>> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>>>>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
>>>>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>>>>>>> 
>>>>>>> This is a new patch set for the f2fs file system.
>>>>>>> 
>>>>>>> What is F2FS?
>>>>>>> =============
>>>>>>> 
>>>>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
>>>>>>> been widely being used for ranging from mobile to server systems. Since they are
>>>>>>> known to have different characteristics from the conventional rotational disks,
>>>>>>> a file system, an upper layer to the storage device, should adapt to the changes
>>>>>>> from the sketch.
>>>>>>> 
>>>>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
>>>>>>> devices. We chose a log structure file system approach, but we tried to adapt it
>>>>>>> to the new form of storage. Also we remedy some known issues of the very old log
>>>>>>> structured file system, such as snowball effect of wandering tree and high cleaning
>>>>>>> overhead.
>>>>>>> 
>>>>>>> Because a NAND-based storage device shows different characteristics according to
>>>>>>> its internal geometry or flash memory management scheme aka FTL, we add various
>>>>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
>>>>>>> and cleaning algorithms.
>>>>>>> 
>>>>>> 
>>>>>> What about F2FS performance? Could you share benchmarking results of the new file system?
>>>>>> 
>>>>>> It is very interesting the case of aged file system. How is GC's implementation efficient? Could
>>>> you share benchmarking results for the very aged file system state?
>>>>>> 
>>>>> 
>>>>> Although I have benchmark results, currently I'd like to see the results
>>>>> measured by community as a black-box. As you know, the results are very
>>>>> dependent on the workloads and parameters, so I think it would be better
>>>>> to see other results for a while.
>>>>> Thanks,
>>>>> 
>>>> 
>>>> 1) Actually it's a strange approach. If you have got any results you
>>>> should share them with the community explaining how (the workload, hw
>>>> and so on) your benchmark works and the specific condition. I really
>>>> don't like the approach "I've got the results but I don't say anything,
>>>> if you want a number, do it yourself".
>>> 
>>> It's definitely right, and I meant *for a while*.
>>> I just wanted to avoid arguing with how to age file system in this time.
>>> Before then, I share the primitive results as follows.
>>> 
>>> 1. iozone in Panda board
>>> - ARM A9
>>> - DRAM : 1GB
>>> - Kernel: Linux 3.3
>>> - Partition: 12GB (64GB Samsung eMMC)
>>> - Tested on 2GB file
>>> 
>>>          seq. read, seq. write, rand. read, rand. write
>>> - ext4:    30.753         17.066       5.06         4.15
>>> - f2fs:    30.71          16.906       5.073       15.204
>>> 
>>> 2. iozone in Galaxy Nexus
>>> - DRAM : 1GB
>>> - Android 4.0.4_r1.2
>>> - Kernel omap 3.0.8
>>> - Partition: /data, 12GB
>>> - Tested on 2GB file
>>> 
>>>          seq. read, seq. write, rand. read,  rand. write
>>> - ext4:    29.88        12.83         11.43          0.56
>>> - f2fs:    29.70        13.34         10.79         12.82
>>> 
>> 
>> 
>> This is results for non-aged filesystem state. Am I correct?
>> 
> 
> Yes, right.
> 
>> 
>>> Due to the company secret, I expect to show other results after presenting f2fs at korea linux forum.
>>> 
>>>> 2) For a new filesystem you should send the patches to linux-fsdevel.
>>> 
>>> Yes, that was totally my mistake.
>>> 
>>>> 3) It's not clear the pros/cons of your filesystem, can you share with
>>>> us the main differences with the current fs already in mainline? Or is
>>>> it a company secret?
>>> 
>>> After forum, I can share the slides, and I hope they will be useful to you.
>>> 
>>> Instead, let me summarize at a glance compared with other file systems.
>>> Here are several log-structured file systems.
>>> Note that, F2FS operates on top of block device with consideration on the FTL behavior.
>>> So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
>>> LogFS is initially designed for raw NAND flash, but expanded to block device.
>>> But, I don't know whether it is stable or not.
>>> NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
>>> IMO, that feature is quite promising and important to users, but it may degrade the performance.
>>> There is a trade-off between functionalities and performance.
>>> F2FS chose high performance without any further fancy functionalities.
>>> 
>> 
>> Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used by
>> users, so, it is very important to guarantee reliability of data keeping. Degradation of performance
>> by means of snapshots is arguable point. Snapshots can solve the problem not only some unpredictable
>> environmental issues but also user's erroneous behavior.
>> 
> 
> Yes, I agree. I concerned the multiple snapshot feature.
> Of course, fault-tolerance is very important, and file system should support it as you know as power-off-recovery.
> f2fs supports the recovery mechanism by adopting checkpoint similar to snapshot.
> But, f2fs does not support multiple snapshots for user convenience.
> I just focused on the performance, and absolutely, the multiple snapshot feature is also a good alternative approach.
> That may be a trade-off.

So, maybe I misunderstand something, but I can't understand the difference. As I know, snapshot in NILFS2 is a checkpoint converted by user in snapshot. So, NILFS2's checkpoint is a log that adds new file system's state changing (user data + metadata). In other words, checkpoint is mechanism of writing on volume. Moreover, NILFS2 gives flexible way of checkpoint/snapshot management.

As you are saying, f2fs supports checkpoints also. It means for me that checkpoints are the basic mechanism of writing operations on f2fs. But, about what performance gain and difference do you talk?

Moreover, user can't manage by f2fs checkpoints completely, as I can understand. It is not so clear what critical points can be a starting points of recovery actions. How is it possible to define how many checkpoints f2fs volume will have?

How many user data (metadata) can be lost in the case of sudden power off? Is it possible to estimate this?

> 
>> As I understand, it is not possible to have a perfect performance in all possible workloads. Could you
>> point out what workloads are the best way of F2FS using?
> 
> Basically I think the following workloads will be good for F2FS.
> - Many random writes : it's LFS nature
> - Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.
> 

Yes, it can be so for the case of non-aged f2fs volume. But I am afraid that for the case of aged f2fs volume the situation can be opposite. I think that in the case of aged state of f2fs volume the GC will be under hard work in above-mentioned workloads.

But, as I can understand, smartphones and tablets are the most promising way of f2fs using. Because f2fs designs for NAND flash memory based-storage devices. So, I think that such workloads as "many random writes" or "small writes with frequent fsync" are not so frequent use-cases. Use-case of creation and deletion many small files can be more frequent use-case under smartphones and tablets. But, as I can understand, f2fs has slightly expensive metadata payload in the case of small files creation. Moreover, frequent and random deletion of small files ends in the very sophisticated and unpredictable GC behavior, as I can understand.

>> 
>>> Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
>>> IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from
>> fundamental designs.
>>> I don't know, but why not designing a new file system for flash storages as a counterpart?
>>> 
>> 
>> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-
>> oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of
>> flash storages' peculiarities. Could you point out key points of F2FS design that makes this design
>> fundamentally unique?
> 
> As you can see the f2fs kernel document patch, I think one of the most important features is to align operating units between f2fs and ftl.
> Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit respectively.
> Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary operations done by FTL.
> And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios likewise ext4.
> 

As I can understand, it is not so easy to create partition with f2fs volume which is aligned on operating units (especially in the case of eMMC or SSD). Performance of unaligned volume can degrade significantly because of FTL activity. What mechanisms has f2fs for excluding such situation and achieving of the goal to reduce unnecessary FTL operations?

With the best regards,
Vyacheslav Dubeyko.

>> 
>> With the best regards,
>> Vyacheslav Dubeyko.
>> 
>> 
>>>> 
>>>> Marco
>>> 
>>> ---
>>> Jaegeuk Kim
>>> Samsung
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
> ---
> Jaegeuk Kim
> Samsung
> 


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-08 12:11                   ` Jaegeuk Kim
@ 2012-10-09  3:52                     ` Namjae Jeon
  2012-10-09  8:00                       ` Jaegeuk Kim
  0 siblings, 1 reply; 154+ messages in thread
From: Namjae Jeon @ 2012-10-09  3:52 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Vyacheslav Dubeyko, Marco Stornelli, Jaegeuk Kim, Al Viro, tytso,
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
>> -----Original Message-----
>> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
>> Sent: Monday, October 08, 2012 8:22 PM
>> To: Jaegeuk Kim
>> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro;
>> tytso@mit.edu;
>> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>> chur.lee@samsung.com; cm224.lee@samsung.com;
>> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>>
>> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
>> >> -----Original Message-----
>> >> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
>> >> Sent: Monday, October 08, 2012 7:00 PM
>> >> To: Jaegeuk Kim
>> >> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro;
>> >> tytso@mit.edu;
>> >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
>> >> chur.lee@samsung.com; cm224.lee@samsung.com;
>> >> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
>> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
>> >>
>> >> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
>> >> >> -----Original Message-----
>> >> >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
>> >> >> Sent: Sunday, October 07, 2012 9:09 PM
>> >> >> To: Jaegeuk Kim
>> >> >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
>> >> >> gregkh@linuxfoundation.org; linux-
>> >> >> kernel@vger.kernel.org; chur.lee@samsung.com;
>> >> >> cm224.lee@samsung.com;
>> >> >> jooyoung.hwang@samsung.com;
>> >> >> linux-fsdevel@vger.kernel.org
>> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file
>> >> >> system
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
>> >> >>
>> >> >> >> -----Original Message-----
>> >> >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
>> >> >> >> Sent: Sunday, October 07, 2012 4:10 PM
>> >> >> >> To: Jaegeuk Kim
>> >> >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
>> >> >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
>> >> >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
>> >> >> >> cm224.lee@samsung.com;
>> >> >> jooyoung.hwang@samsung.com;
>> >> >> >> linux-fsdevel@vger.kernel.org
>> >> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file
>> >> >> >> system
>> >> >> >>
>> >> >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
>> >> >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
>> >> >> >>>> Hi Jaegeuk,
>> >> >> >>>
>> >> >> >>> Hi.
>> >> >> >>> We know each other, right? :)
>> >> >> >>>
>> >> >> >>>>
>> >> >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
>> >> >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o'
>> >> >> >>>>> <tytso@mit.edu>,
>> >> >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
>> >> >> >> chur.lee@samsung.com,
>> >> >> cm224.lee@samsung.com,
>> >> >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
>> >> >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file
>> >> >> >>>>> system
>> >> >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
>> >> >> >>>>>
>> >> >> >>>>> This is a new patch set for the f2fs file system.
>> >> >> >>>>>
>> >> >> >>>>> What is F2FS?
>> >> >> >>>>> =============
>> >> >> >>>>>
>> >> >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC,
>> >> >> >>>>> and
>> >> >> >>>>> SD
>> >> >> >>>>> cards, have
>> >> >> >>>>> been widely being used for ranging from mobile to server
>> >> >> >>>>> systems.
>> >> >> >>>>> Since they are
>> >> >> >>>>> known to have different characteristics from the conventional
>> >> >> >>>>> rotational disks,
>> >> >> >>>>> a file system, an upper layer to the storage device, should
>> >> >> >>>>> adapt
>> >> >> >>>>> to
>> >> >> >>>>> the changes
>> >> >> >>>>> from the sketch.
>> >> >> >>>>>
>> >> >> >>>>> F2FS is a new file system carefully designed for the NAND
>> >> >> >>>>> flash
>> >> >> >>>>> memory-based storage
>> >> >> >>>>> devices. We chose a log structure file system approach, but
>> >> >> >>>>> we
>> >> >> >>>>> tried
>> >> >> >>>>> to adapt it
>> >> >> >>>>> to the new form of storage. Also we remedy some known issues
>> >> >> >>>>> of
>> >> >> >>>>> the
>> >> >> >>>>> very old log
>> >> >> >>>>> structured file system, such as snowball effect of wandering
>> >> >> >>>>> tree
>> >> >> >>>>> and high cleaning
>> >> >> >>>>> overhead.
>> >> >> >>>>>
>> >> >> >>>>> Because a NAND-based storage device shows different
>> >> >> >>>>> characteristics
>> >> >> >>>>> according to
>> >> >> >>>>> its internal geometry or flash memory management scheme aka
>> >> >> >>>>> FTL,
>> >> >> >>>>> we
>> >> >> >>>>> add various
>> >> >> >>>>> parameters not only for configuring on-disk layout, but also
>> >> >> >>>>> for
>> >> >> >>>>> selecting allocation
>> >> >> >>>>> and cleaning algorithms.
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> What about F2FS performance? Could you share benchmarking
>> >> >> >>>> results
>> >> >> >>>> of
>> >> >> >>>> the new file system?
>> >> >> >>>>
>> >> >> >>>> It is very interesting the case of aged file system. How is
>> >> >> >>>> GC's
>> >> >> >>>> implementation efficient? Could
>> >> >> >> you share benchmarking results for the very aged file system
>> >> >> >> state?
>> >> >> >>>>
>> >> >> >>>
>> >> >> >>> Although I have benchmark results, currently I'd like to see
>> >> >> >>> the
>> >> >> >>> results
>> >> >> >>> measured by community as a black-box. As you know, the results
>> >> >> >>> are
>> >> >> >>> very
>> >> >> >>> dependent on the workloads and parameters, so I think it would
>> >> >> >>> be
>> >> >> >>> better
>> >> >> >>> to see other results for a while.
>> >> >> >>> Thanks,
>> >> >> >>>
>> >> >> >>
>> >> >> >> 1) Actually it's a strange approach. If you have got any results
>> >> >> >> you
>> >> >> >> should share them with the community explaining how (the
>> >> >> >> workload,
>> >> >> >> hw
>> >> >> >> and so on) your benchmark works and the specific condition. I
>> >> >> >> really
>> >> >> >> don't like the approach "I've got the results but I don't say
>> >> >> >> anything,
>> >> >> >> if you want a number, do it yourself".
>> >> >> >
>> >> >> > It's definitely right, and I meant *for a while*.
>> >> >> > I just wanted to avoid arguing with how to age file system in
>> >> >> > this
>> >> >> > time.
>> >> >> > Before then, I share the primitive results as follows.
>> >> >> >
>> >> >> > 1. iozone in Panda board
>> >> >> > - ARM A9
>> >> >> > - DRAM : 1GB
>> >> >> > - Kernel: Linux 3.3
>> >> >> > - Partition: 12GB (64GB Samsung eMMC)
>> >> >> > - Tested on 2GB file
>> >> >> >
>> >> >> >           seq. read, seq. write, rand. read, rand. write
>> >> >> > - ext4:    30.753         17.066       5.06         4.15
>> >> >> > - f2fs:    30.71          16.906       5.073       15.204
>> >> >> >
>> >> >> > 2. iozone in Galaxy Nexus
>> >> >> > - DRAM : 1GB
>> >> >> > - Android 4.0.4_r1.2
>> >> >> > - Kernel omap 3.0.8
>> >> >> > - Partition: /data, 12GB
>> >> >> > - Tested on 2GB file
>> >> >> >
>> >> >> >           seq. read, seq. write, rand. read,  rand. write
>> >> >> > - ext4:    29.88        12.83         11.43          0.56
>> >> >> > - f2fs:    29.70        13.34         10.79         12.82
>> >> >> >
>> >> >>
>> >> >>
>> >> >> This is results for non-aged filesystem state. Am I correct?
>> >> >>
>> >> >
>> >> > Yes, right.
>> >> >
>> >> >>
>> >> >> > Due to the company secret, I expect to show other results after
>> >> >> > presenting f2fs at korea linux forum.
>> >> >> >
>> >> >> >> 2) For a new filesystem you should send the patches to
>> >> >> >> linux-fsdevel.
>> >> >> >
>> >> >> > Yes, that was totally my mistake.
>> >> >> >
>> >> >> >> 3) It's not clear the pros/cons of your filesystem, can you
>> >> >> >> share
>> >> >> >> with
>> >> >> >> us the main differences with the current fs already in mainline?
>> >> >> >> Or
>> >> >> >> is
>> >> >> >> it a company secret?
>> >> >> >
>> >> >> > After forum, I can share the slides, and I hope they will be
>> >> >> > useful
>> >> >> > to
>> >> >> > you.
>> >> >> >
>> >> >> > Instead, let me summarize at a glance compared with other file
>> >> >> > systems.
>> >> >> > Here are several log-structured file systems.
>> >> >> > Note that, F2FS operates on top of block device with
>> >> >> > consideration
>> >> >> > on
>> >> >> > the FTL behavior.
>> >> >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are
>> >> >> > designed
>> >> >> > for raw NAND flash.
>> >> >> > LogFS is initially designed for raw NAND flash, but expanded to
>> >> >> > block
>> >> >> > device.
>> >> >> > But, I don't know whether it is stable or not.
>> >> >> > NILFS2 is one of major log-structured file systems, which
>> >> >> > supports
>> >> >> > multiple snap-shots.
>> >> >> > IMO, that feature is quite promising and important to users, but
>> >> >> > it
>> >> >> > may
>> >> >> > degrade the performance.
>> >> >> > There is a trade-off between functionalities and performance.
>> >> >> > F2FS chose high performance without any further fancy
>> >> >> > functionalities.
>> >> >> >
>> >> >>
>> >> >> Performance is a good goal. But fault-tolerance is also very
>> >> >> important
>> >> >> point. Filesystems are used by
>> >> >> users, so, it is very important to guarantee reliability of data
>> >> >> keeping.
>> >> >> Degradation of performance
>> >> >> by means of snapshots is arguable point. Snapshots can solve the
>> >> >> problem
>> >> >> not only some unpredictable
>> >> >> environmental issues but also user's erroneous behavior.
>> >> >>
>> >> >
>> >> > Yes, I agree. I concerned the multiple snapshot feature.
>> >> > Of course, fault-tolerance is very important, and file system should
>> >> > support
>> >> > it as you know as power-off-recovery.
>> >> > f2fs supports the recovery mechanism by adopting checkpoint similar
>> >> > to
>> >> > snapshot.
>> >> > But, f2fs does not support multiple snapshots for user convenience.
>> >> > I just focused on the performance, and absolutely, the multiple
>> >> > snapshot
>> >> > feature is also a good alternative approach.
>> >> > That may be a trade-off.
>> >> >
>> >> >> As I understand, it is not possible to have a perfect performance
>> >> >> in
>> >> >> all
>> >> >> possible workloads. Could you
>> >> >> point out what workloads are the best way of F2FS using?
>> >> >
>> >> > Basically I think the following workloads will be good for F2FS.
>> >> > - Many random writes : it's LFS nature
>> >> > - Small writes with frequent fsync : f2fs is optimized to reduce the
>> >> > fsync
>> >> > overhead.
>> >> >
>> >> >>
>> >> >> > Maybe or obviously it is possible to optimize ext4 or btrfs to
>> >> >> > flash
>> >> >> > storages.
>> >> >> > IMHO, however, they are originally designed for HDDs, so that it
>> >> >> > may
>> >> >> > or
>> >> >> > may not suffer from
>> >> >> fundamental designs.
>> >> >> > I don't know, but why not designing a new file system for flash
>> >> >> > storages
>> >> >> > as a counterpart?
>> >> >> >
>> >> >>
>> >> >> Yes, it is possible. But F2FS is not flash oriented filesystem as
>> >> >> JFFS2,
>> >> >> YAFFS2, UBIFS but block-
>> >> >> oriented filesystem. So, F2FS design is restricted by block-layer's
>> >> >> opportunities in the using of
>> >> >> flash storages' peculiarities. Could you point out key points of
>> >> >> F2FS
>> >> >> design that makes this design
>> >> >> fundamentally unique?
>> >> >
>> >> > As you can see the f2fs kernel document patch, I think one of the
>> >> > most
>> >> > important features is to align operating units between f2fs and ftl.
>> >> > Specifically, f2fs has section and zone, which are cleaning unit and
>> >> > basic
>> >> > allocation unit respectively.
>> >> > Through these configurable units in f2fs, I think f2fs is able to
>> >> > reduce
>> >> > the
>> >> > unnecessary operations done by FTL.
>> >> > And, in order to avoid changing IO patterns by the block-layer, f2fs
>> >> > merges
>> >> > itself some bios likewise ext4.
>> >> Hello.
>> >> The internal of eMMC and SSD is the blackbox from user side.
>> >> How does the normal user easily set operating units alignment(page
>> >> size and physical block size ?) between f2fs and ftl in storage device
>> >> ?
>> >
>> > I've known that some works have been tried to figure out the units by
>> > profiling the storage, AKA reverse engineering.
>> > In most cases, the simplest way is to measure the latencies of
>> > consecutive
>> > writes and analyze their patterns.
>> > As you mentioned, in practical, users will not want to do this, so maybe
>> > we
>> > need a tool to profile them to optimize f2fs.
>> > In the current state, I think profiling is an another issue, and
>> > mkfs.f2fs
>> > had better include this work in the future.
>> Well, Format tool evaluates optimal block size whenever formatting? As
>> you know, The size of Flash Based storage device is increasing every
>> year. It means format time can be too long on larger devices(e.g. one
>> device, one parition).
>
> Every file systems will suffer from the long format time in such a huge
> device.
> And, I don't think the profiling time would not be scaled up, since it's
> unnecessary to scan whole device.
> After getting the size, we just can stop it.
The key point is that you should estimate correct optimal block size
of ftl with much less I/O at format time.
I am not sure it is possible.
And you should prove optimal block size is really correct on several
device per vendor device.

>
>> > But, IMO, from the viewpoint of performance, default configuration is
>> > quite
>> > enough now.
>> At default(after cleanly format), Would you share performance
>> difference between other log structured filesystems in comparison to
>> f2fs instead of ext4 ?
>>
>
> Actually, we've focused on ext4, so I have no results of other file systems
> measured on embedded systems.
> I'll test sooner or later, and report them.
Okay, Thanks Jaegeuk.

> Thank you for valuable comments.
>
>> Thanks.
>> >
>> > ps) f2fs doesn't care about the flash page size, but considers garbage
>> > collection unit.
>> >
>> >>
>> >> Thanks.
>> >>
>> >> >
>> >> >>
>> >> >> With the best regards,
>> >> >> Vyacheslav Dubeyko.
>> >> >>
>> >> >>
>> >> >> >>
>> >> >> >> Marco
>> >> >> >
>> >> >> > ---
>> >> >> > Jaegeuk Kim
>> >> >> > Samsung
>> >> >> >
>> >> >> > --
>> >> >> > To unsubscribe from this list: send the line "unsubscribe
>> >> >> > linux-kernel"
>> >> >> > in
>> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> > More majordomo info at
>> >> >> > http://vger.kernel.org/majordomo-info.html
>> >> >> > Please read the FAQ at  http://www.tux.org/lkml/
>> >> >
>> >> >
>> >> > ---
>> >> > Jaegeuk Kim
>> >> > Samsung
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe
>> >> > linux-fsdevel"
>> >> > in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> >
>> >
>> > ---
>> > Jaegeuk Kim
>> > Samsung
>> >
>> >
>> >
>
>
> ---
> Jaegeuk Kim
> Samsung
>
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-08 19:22             ` Vyacheslav Dubeyko
@ 2012-10-09  7:08                 ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-09  7:08 UTC (permalink / raw)
  To: 'Vyacheslav Dubeyko'
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> Sent: Tuesday, October 09, 2012 4:23 AM
> To: Jaegeuk Kim
> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> Hi,
> 
> On Oct 8, 2012, at 12:25 PM, Jaegeuk Kim wrote:
> 
> >> -----Original Message-----
> >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> >> Sent: Sunday, October 07, 2012 9:09 PM
> >> To: Jaegeuk Kim
> >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> >> linux-fsdevel@vger.kernel.org
> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>
> >> Hi,
> >>
> >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> >>
> >>>> -----Original Message-----
> >>>> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> >>>> Sent: Sunday, October 07, 2012 4:10 PM
> >>>> To: Jaegeuk Kim
> >>>> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu;
> gregkh@linuxfoundation.org;
> >>>> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> >> jooyoung.hwang@samsung.com;
> >>>> linux-fsdevel@vger.kernel.org
> >>>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>>>
> >>>> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> >>>>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >>>>>> Hi Jaegeuk,
> >>>>>
> >>>>> Hi.
> >>>>> We know each other, right? :)
> >>>>>
> >>>>>>
> >>>>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >>>>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> >>>> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com,
> >> cm224.lee@samsung.com,
> >>>> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >>>>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> >>>>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >>>>>>>
> >>>>>>> This is a new patch set for the f2fs file system.
> >>>>>>>
> >>>>>>> What is F2FS?
> >>>>>>> =============
> >>>>>>>
> >>>>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> >>>>>>> been widely being used for ranging from mobile to server systems. Since they are
> >>>>>>> known to have different characteristics from the conventional rotational disks,
> >>>>>>> a file system, an upper layer to the storage device, should adapt to the changes
> >>>>>>> from the sketch.
> >>>>>>>
> >>>>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> >>>>>>> devices. We chose a log structure file system approach, but we tried to adapt it
> >>>>>>> to the new form of storage. Also we remedy some known issues of the very old log
> >>>>>>> structured file system, such as snowball effect of wandering tree and high cleaning
> >>>>>>> overhead.
> >>>>>>>
> >>>>>>> Because a NAND-based storage device shows different characteristics according to
> >>>>>>> its internal geometry or flash memory management scheme aka FTL, we add various
> >>>>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
> >>>>>>> and cleaning algorithms.
> >>>>>>>
> >>>>>>
> >>>>>> What about F2FS performance? Could you share benchmarking results of the new file system?
> >>>>>>
> >>>>>> It is very interesting the case of aged file system. How is GC's implementation efficient?
> Could
> >>>> you share benchmarking results for the very aged file system state?
> >>>>>>
> >>>>>
> >>>>> Although I have benchmark results, currently I'd like to see the results
> >>>>> measured by community as a black-box. As you know, the results are very
> >>>>> dependent on the workloads and parameters, so I think it would be better
> >>>>> to see other results for a while.
> >>>>> Thanks,
> >>>>>
> >>>>
> >>>> 1) Actually it's a strange approach. If you have got any results you
> >>>> should share them with the community explaining how (the workload, hw
> >>>> and so on) your benchmark works and the specific condition. I really
> >>>> don't like the approach "I've got the results but I don't say anything,
> >>>> if you want a number, do it yourself".
> >>>
> >>> It's definitely right, and I meant *for a while*.
> >>> I just wanted to avoid arguing with how to age file system in this time.
> >>> Before then, I share the primitive results as follows.
> >>>
> >>> 1. iozone in Panda board
> >>> - ARM A9
> >>> - DRAM : 1GB
> >>> - Kernel: Linux 3.3
> >>> - Partition: 12GB (64GB Samsung eMMC)
> >>> - Tested on 2GB file
> >>>
> >>>          seq. read, seq. write, rand. read, rand. write
> >>> - ext4:    30.753         17.066       5.06         4.15
> >>> - f2fs:    30.71          16.906       5.073       15.204
> >>>
> >>> 2. iozone in Galaxy Nexus
> >>> - DRAM : 1GB
> >>> - Android 4.0.4_r1.2
> >>> - Kernel omap 3.0.8
> >>> - Partition: /data, 12GB
> >>> - Tested on 2GB file
> >>>
> >>>          seq. read, seq. write, rand. read,  rand. write
> >>> - ext4:    29.88        12.83         11.43          0.56
> >>> - f2fs:    29.70        13.34         10.79         12.82
> >>>
> >>
> >>
> >> This is results for non-aged filesystem state. Am I correct?
> >>
> >
> > Yes, right.
> >
> >>
> >>> Due to the company secret, I expect to show other results after presenting f2fs at korea linux
> forum.
> >>>
> >>>> 2) For a new filesystem you should send the patches to linux-fsdevel.
> >>>
> >>> Yes, that was totally my mistake.
> >>>
> >>>> 3) It's not clear the pros/cons of your filesystem, can you share with
> >>>> us the main differences with the current fs already in mainline? Or is
> >>>> it a company secret?
> >>>
> >>> After forum, I can share the slides, and I hope they will be useful to you.
> >>>
> >>> Instead, let me summarize at a glance compared with other file systems.
> >>> Here are several log-structured file systems.
> >>> Note that, F2FS operates on top of block device with consideration on the FTL behavior.
> >>> So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
> >>> LogFS is initially designed for raw NAND flash, but expanded to block device.
> >>> But, I don't know whether it is stable or not.
> >>> NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> >>> IMO, that feature is quite promising and important to users, but it may degrade the performance.
> >>> There is a trade-off between functionalities and performance.
> >>> F2FS chose high performance without any further fancy functionalities.
> >>>
> >>
> >> Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used
> by
> >> users, so, it is very important to guarantee reliability of data keeping. Degradation of
> performance
> >> by means of snapshots is arguable point. Snapshots can solve the problem not only some
> unpredictable
> >> environmental issues but also user's erroneous behavior.
> >>
> >
> > Yes, I agree. I concerned the multiple snapshot feature.
> > Of course, fault-tolerance is very important, and file system should support it as you know as
> power-off-recovery.
> > f2fs supports the recovery mechanism by adopting checkpoint similar to snapshot.
> > But, f2fs does not support multiple snapshots for user convenience.
> > I just focused on the performance, and absolutely, the multiple snapshot feature is also a good
> alternative approach.
> > That may be a trade-off.
> 
> So, maybe I misunderstand something, but I can't understand the difference. As I know, snapshot in
> NILFS2 is a checkpoint converted by user in snapshot. So, NILFS2's checkpoint is a log that adds new
> file system's state changing (user data + metadata). In other words, checkpoint is mechanism of
> writing on volume. Moreover, NILFS2 gives flexible way of checkpoint/snapshot management.
> 
> As you are saying, f2fs supports checkpoints also. It means for me that checkpoints are the basic
> mechanism of writing operations on f2fs. But, about what performance gain and difference do you talk?

How about the following scenario?
1. data "a" is newly written.
2. checkpoint "A" is done.
3. data "a" is truncated.
4. checkpoint "B" is done.

If fs supports multiple snapshots like "A" and "B" to users, it cannot reuse the space allocated by
data "a" after checkpoint "B" even though data "a" is safely truncated by checkpoint "B".
This is because fs should keep data "a" to prepare a roll-back to "A".
So, even though user sees some free space, LFS may suffer from cleaning due to the exhausted free space.
If users want to avoid this, they have to remove snapshots by themselves. Or, maybe automatically?

> 
> Moreover, user can't manage by f2fs checkpoints completely, as I can understand. It is not so clear
> what critical points can be a starting points of recovery actions. How is it possible to define how
> many checkpoints f2fs volume will have?

IMHO, user does not need to know how many snapshots there exist and track the fs utilization all the time.
(off list: I don't know why cleaning process should be tuned by users.)

f2fs writes two checkpoints alternatively. One is for the last stable checkpoint and another is for next checkpoint.
So, during the recovery, f2fs starts to find one of the latest stable checkpoint.
The stable checkpoint must have whole index structures and data consistently.
As you knew, many things can be found in the following LFS paper.
http://www.cs.berkeley.edu/~brewer/cs262/LFS.pdf


> 
> How many user data (metadata) can be lost in the case of sudden power off? Is it possible to estimate
> this?
> 

If user calls sync, f2fs via vfs writes all the data, and it writes a checkpoint.
In that case, all the data are safe.
After sync, several fsync can be triggered, and it occurs sudden power off.
In that case, f2fs first performs roll-back to the last stable checkpoint among two, and then roll-forward to recover fsync'ed data only.
So, f2fs recovers data triggered by sync or fsync only.

> >
> >> As I understand, it is not possible to have a perfect performance in all possible workloads. Could
> you
> >> point out what workloads are the best way of F2FS using?
> >
> > Basically I think the following workloads will be good for F2FS.
> > - Many random writes : it's LFS nature
> > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.
> >
> 
> Yes, it can be so for the case of non-aged f2fs volume. But I am afraid that for the case of aged f2fs
> volume the situation can be opposite. I think that in the case of aged state of f2fs volume the GC
> will be under hard work in above-mentioned workloads.

Yes, you're right.
In the LFS paper above, there are two logging schemes: threaded logging and copy-and-compaction.
In order to avoid high cleaning overhead, f2fs adopts a hybrid one which changes the allocation policy dynamically
between two schemes.
Threaded logging is similar to the traditional approach, resulting in random writes without cleaning operations.
Copy-and-compaction is another name of cleaning, resulting in sequential writes with cleaning operations.
So, f2fs adopts one of them in runtime according to the file system status.
Through this, we could see the random write performance comparable to ext4 even in the worst case.

> 
> But, as I can understand, smartphones and tablets are the most promising way of f2fs using. Because
> f2fs designs for NAND flash memory based-storage devices. So, I think that such workloads as "many
> random writes" or "small writes with frequent fsync" are not so frequent use-cases. Use-case of
> creation and deletion many small files can be more frequent use-case under smartphones and tablets.
> But, as I can understand, f2fs has slightly expensive metadata payload in the case of small files
> creation. Moreover, frequent and random deletion of small files ends in the very sophisticated and
> unpredictable GC behavior, as I can understand.
> 

I'd like to share the following paper.
http://research.cs.wisc.edu/adsl/Publications/ibench-tocs12.pdf

In our experiments *also* on android phones, we've seen many random patterns with frequent fsync calls.
We found that the main problem is database, and I think f2fs is beneficial to this.
As you mentioned, I agree that it is important to handle many small files too.
It is right that this may cause additional cleaning overhead, and f2fs has some metadata payload overhead.
In order to reduce the cleaning overhead, f2fs adopts static and dynamic hot and cold data separation.
The main goal is to split the data according to their type (e.g., dir inode, file inode, dentry data, etc) as much as possible.
Please see the document in detail.
I think this approach is quite effective to achieve the goal.
BTW, the payload overhead can be resolved by adopting embedding data in the inode likewise ext4.
I think it is also good idea, and I hope to adopt it in future.

> >>
> >>> Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
> >>> IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from
> >> fundamental designs.
> >>> I don't know, but why not designing a new file system for flash storages as a counterpart?
> >>>
> >>
> >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-
> >> oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of
> >> flash storages' peculiarities. Could you point out key points of F2FS design that makes this design
> >> fundamentally unique?
> >
> > As you can see the f2fs kernel document patch, I think one of the most important features is to
> align operating units between f2fs and ftl.
> > Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit
> respectively.
> > Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary operations
> done by FTL.
> > And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios
> likewise ext4.
> >
> 
> As I can understand, it is not so easy to create partition with f2fs volume which is aligned on
> operating units (especially in the case of eMMC or SSD).

Could you explain why it is not so easy?

> Performance of unaligned volume can degrade
> significantly because of FTL activity. What mechanisms has f2fs for excluding such situation and
> achieving of the goal to reduce unnecessary FTL operations?

Could you please explain your concern more exactly?
In the kernel doc, the start address of f2fs data structure is aligned to the segment size (i.e., 2MB).
Do you mean that or another operating units (e.g., section and zone)?

Thanks,

> 
> With the best regards,
> Vyacheslav Dubeyko.
> 
> >>
> >> With the best regards,
> >> Vyacheslav Dubeyko.
> >>
> >>
> >>>>
> >>>> Marco
> >>>
> >>> ---
> >>> Jaegeuk Kim
> >>> Samsung
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> Please read the FAQ at  http://www.tux.org/lkml/
> >
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >


---
Jaegeuk Kim
Samsung


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-09  7:08                 ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-09  7:08 UTC (permalink / raw)
  To: 'Vyacheslav Dubeyko'
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> Sent: Tuesday, October 09, 2012 4:23 AM
> To: Jaegeuk Kim
> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> Hi,
> 
> On Oct 8, 2012, at 12:25 PM, Jaegeuk Kim wrote:
> 
> >> -----Original Message-----
> >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> >> Sent: Sunday, October 07, 2012 9:09 PM
> >> To: Jaegeuk Kim
> >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> >> linux-fsdevel@vger.kernel.org
> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>
> >> Hi,
> >>
> >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> >>
> >>>> -----Original Message-----
> >>>> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> >>>> Sent: Sunday, October 07, 2012 4:10 PM
> >>>> To: Jaegeuk Kim
> >>>> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu;
> gregkh@linuxfoundation.org;
> >>>> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> >> jooyoung.hwang@samsung.com;
> >>>> linux-fsdevel@vger.kernel.org
> >>>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>>>
> >>>> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> >>>>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >>>>>> Hi Jaegeuk,
> >>>>>
> >>>>> Hi.
> >>>>> We know each other, right? :)
> >>>>>
> >>>>>>
> >>>>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >>>>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> >>>> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com,
> >> cm224.lee@samsung.com,
> >>>> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >>>>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> >>>>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >>>>>>>
> >>>>>>> This is a new patch set for the f2fs file system.
> >>>>>>>
> >>>>>>> What is F2FS?
> >>>>>>> =============
> >>>>>>>
> >>>>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> >>>>>>> been widely being used for ranging from mobile to server systems. Since they are
> >>>>>>> known to have different characteristics from the conventional rotational disks,
> >>>>>>> a file system, an upper layer to the storage device, should adapt to the changes
> >>>>>>> from the sketch.
> >>>>>>>
> >>>>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> >>>>>>> devices. We chose a log structure file system approach, but we tried to adapt it
> >>>>>>> to the new form of storage. Also we remedy some known issues of the very old log
> >>>>>>> structured file system, such as snowball effect of wandering tree and high cleaning
> >>>>>>> overhead.
> >>>>>>>
> >>>>>>> Because a NAND-based storage device shows different characteristics according to
> >>>>>>> its internal geometry or flash memory management scheme aka FTL, we add various
> >>>>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
> >>>>>>> and cleaning algorithms.
> >>>>>>>
> >>>>>>
> >>>>>> What about F2FS performance? Could you share benchmarking results of the new file system?
> >>>>>>
> >>>>>> It is very interesting the case of aged file system. How is GC's implementation efficient?
> Could
> >>>> you share benchmarking results for the very aged file system state?
> >>>>>>
> >>>>>
> >>>>> Although I have benchmark results, currently I'd like to see the results
> >>>>> measured by community as a black-box. As you know, the results are very
> >>>>> dependent on the workloads and parameters, so I think it would be better
> >>>>> to see other results for a while.
> >>>>> Thanks,
> >>>>>
> >>>>
> >>>> 1) Actually it's a strange approach. If you have got any results you
> >>>> should share them with the community explaining how (the workload, hw
> >>>> and so on) your benchmark works and the specific condition. I really
> >>>> don't like the approach "I've got the results but I don't say anything,
> >>>> if you want a number, do it yourself".
> >>>
> >>> It's definitely right, and I meant *for a while*.
> >>> I just wanted to avoid arguing with how to age file system in this time.
> >>> Before then, I share the primitive results as follows.
> >>>
> >>> 1. iozone in Panda board
> >>> - ARM A9
> >>> - DRAM : 1GB
> >>> - Kernel: Linux 3.3
> >>> - Partition: 12GB (64GB Samsung eMMC)
> >>> - Tested on 2GB file
> >>>
> >>>          seq. read, seq. write, rand. read, rand. write
> >>> - ext4:    30.753         17.066       5.06         4.15
> >>> - f2fs:    30.71          16.906       5.073       15.204
> >>>
> >>> 2. iozone in Galaxy Nexus
> >>> - DRAM : 1GB
> >>> - Android 4.0.4_r1.2
> >>> - Kernel omap 3.0.8
> >>> - Partition: /data, 12GB
> >>> - Tested on 2GB file
> >>>
> >>>          seq. read, seq. write, rand. read,  rand. write
> >>> - ext4:    29.88        12.83         11.43          0.56
> >>> - f2fs:    29.70        13.34         10.79         12.82
> >>>
> >>
> >>
> >> This is results for non-aged filesystem state. Am I correct?
> >>
> >
> > Yes, right.
> >
> >>
> >>> Due to the company secret, I expect to show other results after presenting f2fs at korea linux
> forum.
> >>>
> >>>> 2) For a new filesystem you should send the patches to linux-fsdevel.
> >>>
> >>> Yes, that was totally my mistake.
> >>>
> >>>> 3) It's not clear the pros/cons of your filesystem, can you share with
> >>>> us the main differences with the current fs already in mainline? Or is
> >>>> it a company secret?
> >>>
> >>> After forum, I can share the slides, and I hope they will be useful to you.
> >>>
> >>> Instead, let me summarize at a glance compared with other file systems.
> >>> Here are several log-structured file systems.
> >>> Note that, F2FS operates on top of block device with consideration on the FTL behavior.
> >>> So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
> >>> LogFS is initially designed for raw NAND flash, but expanded to block device.
> >>> But, I don't know whether it is stable or not.
> >>> NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> >>> IMO, that feature is quite promising and important to users, but it may degrade the performance.
> >>> There is a trade-off between functionalities and performance.
> >>> F2FS chose high performance without any further fancy functionalities.
> >>>
> >>
> >> Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used
> by
> >> users, so, it is very important to guarantee reliability of data keeping. Degradation of
> performance
> >> by means of snapshots is arguable point. Snapshots can solve the problem not only some
> unpredictable
> >> environmental issues but also user's erroneous behavior.
> >>
> >
> > Yes, I agree. I concerned the multiple snapshot feature.
> > Of course, fault-tolerance is very important, and file system should support it as you know as
> power-off-recovery.
> > f2fs supports the recovery mechanism by adopting checkpoint similar to snapshot.
> > But, f2fs does not support multiple snapshots for user convenience.
> > I just focused on the performance, and absolutely, the multiple snapshot feature is also a good
> alternative approach.
> > That may be a trade-off.
> 
> So, maybe I misunderstand something, but I can't understand the difference. As I know, snapshot in
> NILFS2 is a checkpoint converted by user in snapshot. So, NILFS2's checkpoint is a log that adds new
> file system's state changing (user data + metadata). In other words, checkpoint is mechanism of
> writing on volume. Moreover, NILFS2 gives flexible way of checkpoint/snapshot management.
> 
> As you are saying, f2fs supports checkpoints also. It means for me that checkpoints are the basic
> mechanism of writing operations on f2fs. But, about what performance gain and difference do you talk?

How about the following scenario?
1. data "a" is newly written.
2. checkpoint "A" is done.
3. data "a" is truncated.
4. checkpoint "B" is done.

If fs supports multiple snapshots like "A" and "B" to users, it cannot reuse the space allocated by
data "a" after checkpoint "B" even though data "a" is safely truncated by checkpoint "B".
This is because fs should keep data "a" to prepare a roll-back to "A".
So, even though user sees some free space, LFS may suffer from cleaning due to the exhausted free space.
If users want to avoid this, they have to remove snapshots by themselves. Or, maybe automatically?

> 
> Moreover, user can't manage by f2fs checkpoints completely, as I can understand. It is not so clear
> what critical points can be a starting points of recovery actions. How is it possible to define how
> many checkpoints f2fs volume will have?

IMHO, user does not need to know how many snapshots there exist and track the fs utilization all the time.
(off list: I don't know why cleaning process should be tuned by users.)

f2fs writes two checkpoints alternatively. One is for the last stable checkpoint and another is for next checkpoint.
So, during the recovery, f2fs starts to find one of the latest stable checkpoint.
The stable checkpoint must have whole index structures and data consistently.
As you knew, many things can be found in the following LFS paper.
http://www.cs.berkeley.edu/~brewer/cs262/LFS.pdf


> 
> How many user data (metadata) can be lost in the case of sudden power off? Is it possible to estimate
> this?
> 

If user calls sync, f2fs via vfs writes all the data, and it writes a checkpoint.
In that case, all the data are safe.
After sync, several fsync can be triggered, and it occurs sudden power off.
In that case, f2fs first performs roll-back to the last stable checkpoint among two, and then roll-forward to recover fsync'ed data only.
So, f2fs recovers data triggered by sync or fsync only.

> >
> >> As I understand, it is not possible to have a perfect performance in all possible workloads. Could
> you
> >> point out what workloads are the best way of F2FS using?
> >
> > Basically I think the following workloads will be good for F2FS.
> > - Many random writes : it's LFS nature
> > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.
> >
> 
> Yes, it can be so for the case of non-aged f2fs volume. But I am afraid that for the case of aged f2fs
> volume the situation can be opposite. I think that in the case of aged state of f2fs volume the GC
> will be under hard work in above-mentioned workloads.

Yes, you're right.
In the LFS paper above, there are two logging schemes: threaded logging and copy-and-compaction.
In order to avoid high cleaning overhead, f2fs adopts a hybrid one which changes the allocation policy dynamically
between two schemes.
Threaded logging is similar to the traditional approach, resulting in random writes without cleaning operations.
Copy-and-compaction is another name of cleaning, resulting in sequential writes with cleaning operations.
So, f2fs adopts one of them in runtime according to the file system status.
Through this, we could see the random write performance comparable to ext4 even in the worst case.

> 
> But, as I can understand, smartphones and tablets are the most promising way of f2fs using. Because
> f2fs designs for NAND flash memory based-storage devices. So, I think that such workloads as "many
> random writes" or "small writes with frequent fsync" are not so frequent use-cases. Use-case of
> creation and deletion many small files can be more frequent use-case under smartphones and tablets.
> But, as I can understand, f2fs has slightly expensive metadata payload in the case of small files
> creation. Moreover, frequent and random deletion of small files ends in the very sophisticated and
> unpredictable GC behavior, as I can understand.
> 

I'd like to share the following paper.
http://research.cs.wisc.edu/adsl/Publications/ibench-tocs12.pdf

In our experiments *also* on android phones, we've seen many random patterns with frequent fsync calls.
We found that the main problem is database, and I think f2fs is beneficial to this.
As you mentioned, I agree that it is important to handle many small files too.
It is right that this may cause additional cleaning overhead, and f2fs has some metadata payload overhead.
In order to reduce the cleaning overhead, f2fs adopts static and dynamic hot and cold data separation.
The main goal is to split the data according to their type (e.g., dir inode, file inode, dentry data, etc) as much as possible.
Please see the document in detail.
I think this approach is quite effective to achieve the goal.
BTW, the payload overhead can be resolved by adopting embedding data in the inode likewise ext4.
I think it is also good idea, and I hope to adopt it in future.

> >>
> >>> Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
> >>> IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from
> >> fundamental designs.
> >>> I don't know, but why not designing a new file system for flash storages as a counterpart?
> >>>
> >>
> >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-
> >> oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of
> >> flash storages' peculiarities. Could you point out key points of F2FS design that makes this design
> >> fundamentally unique?
> >
> > As you can see the f2fs kernel document patch, I think one of the most important features is to
> align operating units between f2fs and ftl.
> > Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit
> respectively.
> > Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary operations
> done by FTL.
> > And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios
> likewise ext4.
> >
> 
> As I can understand, it is not so easy to create partition with f2fs volume which is aligned on
> operating units (especially in the case of eMMC or SSD).

Could you explain why it is not so easy?

> Performance of unaligned volume can degrade
> significantly because of FTL activity. What mechanisms has f2fs for excluding such situation and
> achieving of the goal to reduce unnecessary FTL operations?

Could you please explain your concern more exactly?
In the kernel doc, the start address of f2fs data structure is aligned to the segment size (i.e., 2MB).
Do you mean that or another operating units (e.g., section and zone)?

Thanks,

> 
> With the best regards,
> Vyacheslav Dubeyko.
> 
> >>
> >> With the best regards,
> >> Vyacheslav Dubeyko.
> >>
> >>
> >>>>
> >>>> Marco
> >>>
> >>> ---
> >>> Jaegeuk Kim
> >>> Samsung
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> Please read the FAQ at  http://www.tux.org/lkml/
> >
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >


---
Jaegeuk Kim
Samsung

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09  3:52                     ` Namjae Jeon
@ 2012-10-09  8:00                       ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-09  8:00 UTC (permalink / raw)
  To: 'Namjae Jeon'
  Cc: 'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel



---
Jaegeuk Kim
Samsung


> -----Original Message-----
> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> Sent: Tuesday, October 09, 2012 12:52 PM
> To: Jaegeuk Kim
> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; tytso@mit.edu;
> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> >> -----Original Message-----
> >> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> >> Sent: Monday, October 08, 2012 8:22 PM
> >> To: Jaegeuk Kim
> >> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro;
> >> tytso@mit.edu;
> >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> chur.lee@samsung.com; cm224.lee@samsung.com;
> >> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >>
> >> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> >> >> -----Original Message-----
> >> >> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> >> >> Sent: Monday, October 08, 2012 7:00 PM
> >> >> To: Jaegeuk Kim
> >> >> Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro;
> >> >> tytso@mit.edu;
> >> >> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> >> >> chur.lee@samsung.com; cm224.lee@samsung.com;
> >> >> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >> >>
> >> >> 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> >> >> >> -----Original Message-----
> >> >> >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> >> >> >> Sent: Sunday, October 07, 2012 9:09 PM
> >> >> >> To: Jaegeuk Kim
> >> >> >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> >> >> >> gregkh@linuxfoundation.org; linux-
> >> >> >> kernel@vger.kernel.org; chur.lee@samsung.com;
> >> >> >> cm224.lee@samsung.com;
> >> >> >> jooyoung.hwang@samsung.com;
> >> >> >> linux-fsdevel@vger.kernel.org
> >> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file
> >> >> >> system
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> >> >> >>
> >> >> >> >> -----Original Message-----
> >> >> >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> >> >> >> >> Sent: Sunday, October 07, 2012 4:10 PM
> >> >> >> >> To: Jaegeuk Kim
> >> >> >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
> >> >> >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
> >> >> >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> >> >> >> >> cm224.lee@samsung.com;
> >> >> >> jooyoung.hwang@samsung.com;
> >> >> >> >> linux-fsdevel@vger.kernel.org
> >> >> >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file
> >> >> >> >> system
> >> >> >> >>
> >> >> >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> >> >> >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> >> >> >> >>>> Hi Jaegeuk,
> >> >> >> >>>
> >> >> >> >>> Hi.
> >> >> >> >>> We know each other, right? :)
> >> >> >> >>>
> >> >> >> >>>>
> >> >> >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> >> >> >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o'
> >> >> >> >>>>> <tytso@mit.edu>,
> >> >> >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> >> >> >> >> chur.lee@samsung.com,
> >> >> >> cm224.lee@samsung.com,
> >> >> >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> >> >> >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file
> >> >> >> >>>>> system
> >> >> >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> >> >> >> >>>>>
> >> >> >> >>>>> This is a new patch set for the f2fs file system.
> >> >> >> >>>>>
> >> >> >> >>>>> What is F2FS?
> >> >> >> >>>>> =============
> >> >> >> >>>>>
> >> >> >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC,
> >> >> >> >>>>> and
> >> >> >> >>>>> SD
> >> >> >> >>>>> cards, have
> >> >> >> >>>>> been widely being used for ranging from mobile to server
> >> >> >> >>>>> systems.
> >> >> >> >>>>> Since they are
> >> >> >> >>>>> known to have different characteristics from the conventional
> >> >> >> >>>>> rotational disks,
> >> >> >> >>>>> a file system, an upper layer to the storage device, should
> >> >> >> >>>>> adapt
> >> >> >> >>>>> to
> >> >> >> >>>>> the changes
> >> >> >> >>>>> from the sketch.
> >> >> >> >>>>>
> >> >> >> >>>>> F2FS is a new file system carefully designed for the NAND
> >> >> >> >>>>> flash
> >> >> >> >>>>> memory-based storage
> >> >> >> >>>>> devices. We chose a log structure file system approach, but
> >> >> >> >>>>> we
> >> >> >> >>>>> tried
> >> >> >> >>>>> to adapt it
> >> >> >> >>>>> to the new form of storage. Also we remedy some known issues
> >> >> >> >>>>> of
> >> >> >> >>>>> the
> >> >> >> >>>>> very old log
> >> >> >> >>>>> structured file system, such as snowball effect of wandering
> >> >> >> >>>>> tree
> >> >> >> >>>>> and high cleaning
> >> >> >> >>>>> overhead.
> >> >> >> >>>>>
> >> >> >> >>>>> Because a NAND-based storage device shows different
> >> >> >> >>>>> characteristics
> >> >> >> >>>>> according to
> >> >> >> >>>>> its internal geometry or flash memory management scheme aka
> >> >> >> >>>>> FTL,
> >> >> >> >>>>> we
> >> >> >> >>>>> add various
> >> >> >> >>>>> parameters not only for configuring on-disk layout, but also
> >> >> >> >>>>> for
> >> >> >> >>>>> selecting allocation
> >> >> >> >>>>> and cleaning algorithms.
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>> What about F2FS performance? Could you share benchmarking
> >> >> >> >>>> results
> >> >> >> >>>> of
> >> >> >> >>>> the new file system?
> >> >> >> >>>>
> >> >> >> >>>> It is very interesting the case of aged file system. How is
> >> >> >> >>>> GC's
> >> >> >> >>>> implementation efficient? Could
> >> >> >> >> you share benchmarking results for the very aged file system
> >> >> >> >> state?
> >> >> >> >>>>
> >> >> >> >>>
> >> >> >> >>> Although I have benchmark results, currently I'd like to see
> >> >> >> >>> the
> >> >> >> >>> results
> >> >> >> >>> measured by community as a black-box. As you know, the results
> >> >> >> >>> are
> >> >> >> >>> very
> >> >> >> >>> dependent on the workloads and parameters, so I think it would
> >> >> >> >>> be
> >> >> >> >>> better
> >> >> >> >>> to see other results for a while.
> >> >> >> >>> Thanks,
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >> 1) Actually it's a strange approach. If you have got any results
> >> >> >> >> you
> >> >> >> >> should share them with the community explaining how (the
> >> >> >> >> workload,
> >> >> >> >> hw
> >> >> >> >> and so on) your benchmark works and the specific condition. I
> >> >> >> >> really
> >> >> >> >> don't like the approach "I've got the results but I don't say
> >> >> >> >> anything,
> >> >> >> >> if you want a number, do it yourself".
> >> >> >> >
> >> >> >> > It's definitely right, and I meant *for a while*.
> >> >> >> > I just wanted to avoid arguing with how to age file system in
> >> >> >> > this
> >> >> >> > time.
> >> >> >> > Before then, I share the primitive results as follows.
> >> >> >> >
> >> >> >> > 1. iozone in Panda board
> >> >> >> > - ARM A9
> >> >> >> > - DRAM : 1GB
> >> >> >> > - Kernel: Linux 3.3
> >> >> >> > - Partition: 12GB (64GB Samsung eMMC)
> >> >> >> > - Tested on 2GB file
> >> >> >> >
> >> >> >> >           seq. read, seq. write, rand. read, rand. write
> >> >> >> > - ext4:    30.753         17.066       5.06         4.15
> >> >> >> > - f2fs:    30.71          16.906       5.073       15.204
> >> >> >> >
> >> >> >> > 2. iozone in Galaxy Nexus
> >> >> >> > - DRAM : 1GB
> >> >> >> > - Android 4.0.4_r1.2
> >> >> >> > - Kernel omap 3.0.8
> >> >> >> > - Partition: /data, 12GB
> >> >> >> > - Tested on 2GB file
> >> >> >> >
> >> >> >> >           seq. read, seq. write, rand. read,  rand. write
> >> >> >> > - ext4:    29.88        12.83         11.43          0.56
> >> >> >> > - f2fs:    29.70        13.34         10.79         12.82
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> This is results for non-aged filesystem state. Am I correct?
> >> >> >>
> >> >> >
> >> >> > Yes, right.
> >> >> >
> >> >> >>
> >> >> >> > Due to the company secret, I expect to show other results after
> >> >> >> > presenting f2fs at korea linux forum.
> >> >> >> >
> >> >> >> >> 2) For a new filesystem you should send the patches to
> >> >> >> >> linux-fsdevel.
> >> >> >> >
> >> >> >> > Yes, that was totally my mistake.
> >> >> >> >
> >> >> >> >> 3) It's not clear the pros/cons of your filesystem, can you
> >> >> >> >> share
> >> >> >> >> with
> >> >> >> >> us the main differences with the current fs already in mainline?
> >> >> >> >> Or
> >> >> >> >> is
> >> >> >> >> it a company secret?
> >> >> >> >
> >> >> >> > After forum, I can share the slides, and I hope they will be
> >> >> >> > useful
> >> >> >> > to
> >> >> >> > you.
> >> >> >> >
> >> >> >> > Instead, let me summarize at a glance compared with other file
> >> >> >> > systems.
> >> >> >> > Here are several log-structured file systems.
> >> >> >> > Note that, F2FS operates on top of block device with
> >> >> >> > consideration
> >> >> >> > on
> >> >> >> > the FTL behavior.
> >> >> >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are
> >> >> >> > designed
> >> >> >> > for raw NAND flash.
> >> >> >> > LogFS is initially designed for raw NAND flash, but expanded to
> >> >> >> > block
> >> >> >> > device.
> >> >> >> > But, I don't know whether it is stable or not.
> >> >> >> > NILFS2 is one of major log-structured file systems, which
> >> >> >> > supports
> >> >> >> > multiple snap-shots.
> >> >> >> > IMO, that feature is quite promising and important to users, but
> >> >> >> > it
> >> >> >> > may
> >> >> >> > degrade the performance.
> >> >> >> > There is a trade-off between functionalities and performance.
> >> >> >> > F2FS chose high performance without any further fancy
> >> >> >> > functionalities.
> >> >> >> >
> >> >> >>
> >> >> >> Performance is a good goal. But fault-tolerance is also very
> >> >> >> important
> >> >> >> point. Filesystems are used by
> >> >> >> users, so, it is very important to guarantee reliability of data
> >> >> >> keeping.
> >> >> >> Degradation of performance
> >> >> >> by means of snapshots is arguable point. Snapshots can solve the
> >> >> >> problem
> >> >> >> not only some unpredictable
> >> >> >> environmental issues but also user's erroneous behavior.
> >> >> >>
> >> >> >
> >> >> > Yes, I agree. I concerned the multiple snapshot feature.
> >> >> > Of course, fault-tolerance is very important, and file system should
> >> >> > support
> >> >> > it as you know as power-off-recovery.
> >> >> > f2fs supports the recovery mechanism by adopting checkpoint similar
> >> >> > to
> >> >> > snapshot.
> >> >> > But, f2fs does not support multiple snapshots for user convenience.
> >> >> > I just focused on the performance, and absolutely, the multiple
> >> >> > snapshot
> >> >> > feature is also a good alternative approach.
> >> >> > That may be a trade-off.
> >> >> >
> >> >> >> As I understand, it is not possible to have a perfect performance
> >> >> >> in
> >> >> >> all
> >> >> >> possible workloads. Could you
> >> >> >> point out what workloads are the best way of F2FS using?
> >> >> >
> >> >> > Basically I think the following workloads will be good for F2FS.
> >> >> > - Many random writes : it's LFS nature
> >> >> > - Small writes with frequent fsync : f2fs is optimized to reduce the
> >> >> > fsync
> >> >> > overhead.
> >> >> >
> >> >> >>
> >> >> >> > Maybe or obviously it is possible to optimize ext4 or btrfs to
> >> >> >> > flash
> >> >> >> > storages.
> >> >> >> > IMHO, however, they are originally designed for HDDs, so that it
> >> >> >> > may
> >> >> >> > or
> >> >> >> > may not suffer from
> >> >> >> fundamental designs.
> >> >> >> > I don't know, but why not designing a new file system for flash
> >> >> >> > storages
> >> >> >> > as a counterpart?
> >> >> >> >
> >> >> >>
> >> >> >> Yes, it is possible. But F2FS is not flash oriented filesystem as
> >> >> >> JFFS2,
> >> >> >> YAFFS2, UBIFS but block-
> >> >> >> oriented filesystem. So, F2FS design is restricted by block-layer's
> >> >> >> opportunities in the using of
> >> >> >> flash storages' peculiarities. Could you point out key points of
> >> >> >> F2FS
> >> >> >> design that makes this design
> >> >> >> fundamentally unique?
> >> >> >
> >> >> > As you can see the f2fs kernel document patch, I think one of the
> >> >> > most
> >> >> > important features is to align operating units between f2fs and ftl.
> >> >> > Specifically, f2fs has section and zone, which are cleaning unit and
> >> >> > basic
> >> >> > allocation unit respectively.
> >> >> > Through these configurable units in f2fs, I think f2fs is able to
> >> >> > reduce
> >> >> > the
> >> >> > unnecessary operations done by FTL.
> >> >> > And, in order to avoid changing IO patterns by the block-layer, f2fs
> >> >> > merges
> >> >> > itself some bios likewise ext4.
> >> >> Hello.
> >> >> The internal of eMMC and SSD is the blackbox from user side.
> >> >> How does the normal user easily set operating units alignment(page
> >> >> size and physical block size ?) between f2fs and ftl in storage device
> >> >> ?
> >> >
> >> > I've known that some works have been tried to figure out the units by
> >> > profiling the storage, AKA reverse engineering.
> >> > In most cases, the simplest way is to measure the latencies of
> >> > consecutive
> >> > writes and analyze their patterns.
> >> > As you mentioned, in practical, users will not want to do this, so maybe
> >> > we
> >> > need a tool to profile them to optimize f2fs.
> >> > In the current state, I think profiling is an another issue, and
> >> > mkfs.f2fs
> >> > had better include this work in the future.
> >> Well, Format tool evaluates optimal block size whenever formatting? As
> >> you know, The size of Flash Based storage device is increasing every
> >> year. It means format time can be too long on larger devices(e.g. one
> >> device, one parition).
> >
> > Every file systems will suffer from the long format time in such a huge
> > device.
> > And, I don't think the profiling time would not be scaled up, since it's
> > unnecessary to scan whole device.
> > After getting the size, we just can stop it.
> The key point is that you should estimate correct optimal block size
> of ftl with much less I/O at format time.

Yes, exactly.

> I am not sure it is possible.

Why do you think like that?
As I tested before, I could see a kind of patterns when writing just several tens of MB on eMMC.

> And you should prove optimal block size is really correct on several
> device per vendor device.

Yes, it is correct, but unfortunately, I cannot prove for all the devices.
You're arguing about heuristic vs. optimal approaches.
IMHO, most file systems are based on a heuristic approach.
And f2fs also adopts a heuristic approach, which means it tries to help FTL as much as possible,
not cooperates with FTL directly.
Furthermore, even though the default unit size is not optimal, I believe that it can be well operated in most cases.
(Since most SSDs has 512KB of erase block size, so 2MB can cover 4-way SSDs.)

Thanks,

> 
> >
> >> > But, IMO, from the viewpoint of performance, default configuration is
> >> > quite
> >> > enough now.
> >> At default(after cleanly format), Would you share performance
> >> difference between other log structured filesystems in comparison to
> >> f2fs instead of ext4 ?
> >>
> >
> > Actually, we've focused on ext4, so I have no results of other file systems
> > measured on embedded systems.
> > I'll test sooner or later, and report them.
> Okay, Thanks Jaegeuk.
> 
> > Thank you for valuable comments.
> >
> >> Thanks.
> >> >
> >> > ps) f2fs doesn't care about the flash page size, but considers garbage
> >> > collection unit.
> >> >
> >> >>
> >> >> Thanks.
> >> >>
> >> >> >
> >> >> >>
> >> >> >> With the best regards,
> >> >> >> Vyacheslav Dubeyko.
> >> >> >>
> >> >> >>
> >> >> >> >>
> >> >> >> >> Marco
> >> >> >> >
> >> >> >> > ---
> >> >> >> > Jaegeuk Kim
> >> >> >> > Samsung
> >> >> >> >
> >> >> >> > --
> >> >> >> > To unsubscribe from this list: send the line "unsubscribe
> >> >> >> > linux-kernel"
> >> >> >> > in
> >> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> > More majordomo info at
> >> >> >> > http://vger.kernel.org/majordomo-info.html
> >> >> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >> >> >
> >> >> >
> >> >> > ---
> >> >> > Jaegeuk Kim
> >> >> > Samsung
> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe
> >> >> > linux-fsdevel"
> >> >> > in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> >
> >> >
> >> > ---
> >> > Jaegeuk Kim
> >> > Samsung
> >> >
> >> >
> >> >
> >
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >
> >


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-08 10:52               ` Jaegeuk Kim
  2012-10-08 11:21                 ` Namjae Jeon
@ 2012-10-09  8:31                 ` Lukáš Czerner
  2012-10-09 10:45                     ` Jaegeuk Kim
  2012-10-10 10:36                   ` David Woodhouse
  1 sibling, 2 replies; 154+ messages in thread
From: Lukáš Czerner @ 2012-10-09  8:31 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Namjae Jeon', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 13453 bytes --]

On Mon, 8 Oct 2012, Jaegeuk Kim wrote:

> Date: Mon, 08 Oct 2012 19:52:03 +0900
> From: Jaegeuk Kim <jaegeuk.kim@samsung.com>
> To: 'Namjae Jeon' <linkinjeon@gmail.com>
> Cc: 'Vyacheslav Dubeyko' <slava@dubeyko.com>,
>     'Marco Stornelli' <marco.stornelli@gmail.com>,
>     'Jaegeuk Kim' <jaegeuk.kim@gmail.com>,
>     'Al Viro' <viro@zeniv.linux.org.uk>, tytso@mit.edu,
>     gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
>     chur.lee@samsung.com, cm224.lee@samsung.com, jooyoung.hwang@samsung.com,
>     linux-fsdevel@vger.kernel.org
> Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> > -----Original Message-----
> > From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> > Sent: Monday, October 08, 2012 7:00 PM
> > To: Jaegeuk Kim
> > Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; tytso@mit.edu;
> > gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > 
> > 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> > >> -----Original Message-----
> > >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > >> Sent: Sunday, October 07, 2012 9:09 PM
> > >> To: Jaegeuk Kim
> > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> > >> gregkh@linuxfoundation.org; linux-
> > >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > >> jooyoung.hwang@samsung.com;
> > >> linux-fsdevel@vger.kernel.org
> > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >>
> > >> Hi,
> > >>
> > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > >>
> > >> >> -----Original Message-----
> > >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> > >> >> Sent: Sunday, October 07, 2012 4:10 PM
> > >> >> To: Jaegeuk Kim
> > >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
> > >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
> > >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> > >> >> cm224.lee@samsung.com;
> > >> jooyoung.hwang@samsung.com;
> > >> >> linux-fsdevel@vger.kernel.org
> > >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >> >>
> > >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > >> >>>> Hi Jaegeuk,
> > >> >>>
> > >> >>> Hi.
> > >> >>> We know each other, right? :)
> > >> >>>
> > >> >>>>
> > >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> > >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> > >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > >> >> chur.lee@samsung.com,
> > >> cm224.lee@samsung.com,
> > >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> > >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > >> >>>>>
> > >> >>>>> This is a new patch set for the f2fs file system.
> > >> >>>>>
> > >> >>>>> What is F2FS?
> > >> >>>>> =============
> > >> >>>>>
> > >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD
> > >> >>>>> cards, have
> > >> >>>>> been widely being used for ranging from mobile to server systems.
> > >> >>>>> Since they are
> > >> >>>>> known to have different characteristics from the conventional
> > >> >>>>> rotational disks,
> > >> >>>>> a file system, an upper layer to the storage device, should adapt to
> > >> >>>>> the changes
> > >> >>>>> from the sketch.
> > >> >>>>>
> > >> >>>>> F2FS is a new file system carefully designed for the NAND flash
> > >> >>>>> memory-based storage
> > >> >>>>> devices. We chose a log structure file system approach, but we tried
> > >> >>>>> to adapt it
> > >> >>>>> to the new form of storage. Also we remedy some known issues of the
> > >> >>>>> very old log
> > >> >>>>> structured file system, such as snowball effect of wandering tree
> > >> >>>>> and high cleaning
> > >> >>>>> overhead.
> > >> >>>>>
> > >> >>>>> Because a NAND-based storage device shows different characteristics
> > >> >>>>> according to
> > >> >>>>> its internal geometry or flash memory management scheme aka FTL, we
> > >> >>>>> add various
> > >> >>>>> parameters not only for configuring on-disk layout, but also for
> > >> >>>>> selecting allocation
> > >> >>>>> and cleaning algorithms.
> > >> >>>>>
> > >> >>>>
> > >> >>>> What about F2FS performance? Could you share benchmarking results of
> > >> >>>> the new file system?
> > >> >>>>
> > >> >>>> It is very interesting the case of aged file system. How is GC's
> > >> >>>> implementation efficient? Could
> > >> >> you share benchmarking results for the very aged file system state?
> > >> >>>>
> > >> >>>
> > >> >>> Although I have benchmark results, currently I'd like to see the
> > >> >>> results
> > >> >>> measured by community as a black-box. As you know, the results are
> > >> >>> very
> > >> >>> dependent on the workloads and parameters, so I think it would be
> > >> >>> better
> > >> >>> to see other results for a while.
> > >> >>> Thanks,
> > >> >>>
> > >> >>
> > >> >> 1) Actually it's a strange approach. If you have got any results you
> > >> >> should share them with the community explaining how (the workload, hw
> > >> >> and so on) your benchmark works and the specific condition. I really
> > >> >> don't like the approach "I've got the results but I don't say
> > >> >> anything,
> > >> >> if you want a number, do it yourself".
> > >> >
> > >> > It's definitely right, and I meant *for a while*.
> > >> > I just wanted to avoid arguing with how to age file system in this
> > >> > time.
> > >> > Before then, I share the primitive results as follows.
> > >> >
> > >> > 1. iozone in Panda board
> > >> > - ARM A9
> > >> > - DRAM : 1GB
> > >> > - Kernel: Linux 3.3
> > >> > - Partition: 12GB (64GB Samsung eMMC)
> > >> > - Tested on 2GB file
> > >> >
> > >> >           seq. read, seq. write, rand. read, rand. write
> > >> > - ext4:    30.753         17.066       5.06         4.15
> > >> > - f2fs:    30.71          16.906       5.073       15.204
> > >> >
> > >> > 2. iozone in Galaxy Nexus
> > >> > - DRAM : 1GB
> > >> > - Android 4.0.4_r1.2
> > >> > - Kernel omap 3.0.8
> > >> > - Partition: /data, 12GB
> > >> > - Tested on 2GB file
> > >> >
> > >> >           seq. read, seq. write, rand. read,  rand. write
> > >> > - ext4:    29.88        12.83         11.43          0.56
> > >> > - f2fs:    29.70        13.34         10.79         12.82
> > >> >
> > >>
> > >>
> > >> This is results for non-aged filesystem state. Am I correct?
> > >>
> > >
> > > Yes, right.
> > >
> > >>
> > >> > Due to the company secret, I expect to show other results after
> > >> > presenting f2fs at korea linux forum.
> > >> >
> > >> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
> > >> >
> > >> > Yes, that was totally my mistake.
> > >> >
> > >> >> 3) It's not clear the pros/cons of your filesystem, can you share with
> > >> >> us the main differences with the current fs already in mainline? Or is
> > >> >> it a company secret?
> > >> >
> > >> > After forum, I can share the slides, and I hope they will be useful to
> > >> > you.
> > >> >
> > >> > Instead, let me summarize at a glance compared with other file systems.
> > >> > Here are several log-structured file systems.
> > >> > Note that, F2FS operates on top of block device with consideration on
> > >> > the FTL behavior.
> > >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed
> > >> > for raw NAND flash.
> > >> > LogFS is initially designed for raw NAND flash, but expanded to block
> > >> > device.
> > >> > But, I don't know whether it is stable or not.
> > >> > NILFS2 is one of major log-structured file systems, which supports
> > >> > multiple snap-shots.
> > >> > IMO, that feature is quite promising and important to users, but it may
> > >> > degrade the performance.
> > >> > There is a trade-off between functionalities and performance.
> > >> > F2FS chose high performance without any further fancy functionalities.
> > >> >
> > >>
> > >> Performance is a good goal. But fault-tolerance is also very important
> > >> point. Filesystems are used by
> > >> users, so, it is very important to guarantee reliability of data keeping.
> > >> Degradation of performance
> > >> by means of snapshots is arguable point. Snapshots can solve the problem
> > >> not only some unpredictable
> > >> environmental issues but also user's erroneous behavior.
> > >>
> > >
> > > Yes, I agree. I concerned the multiple snapshot feature.
> > > Of course, fault-tolerance is very important, and file system should support
> > > it as you know as power-off-recovery.
> > > f2fs supports the recovery mechanism by adopting checkpoint similar to
> > > snapshot.
> > > But, f2fs does not support multiple snapshots for user convenience.
> > > I just focused on the performance, and absolutely, the multiple snapshot
> > > feature is also a good alternative approach.
> > > That may be a trade-off.
> > >
> > >> As I understand, it is not possible to have a perfect performance in all
> > >> possible workloads. Could you
> > >> point out what workloads are the best way of F2FS using?
> > >
> > > Basically I think the following workloads will be good for F2FS.
> > > - Many random writes : it's LFS nature
> > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync
> > > overhead.
> > >
> > >>
> > >> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
> > >> > storages.
> > >> > IMHO, however, they are originally designed for HDDs, so that it may or
> > >> > may not suffer from
> > >> fundamental designs.
> > >> > I don't know, but why not designing a new file system for flash storages
> > >> > as a counterpart?
> > >> >
> > >>
> > >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2,
> > >> YAFFS2, UBIFS but block-
> > >> oriented filesystem. So, F2FS design is restricted by block-layer's
> > >> opportunities in the using of
> > >> flash storages' peculiarities. Could you point out key points of F2FS
> > >> design that makes this design
> > >> fundamentally unique?
> > >
> > > As you can see the f2fs kernel document patch, I think one of the most
> > > important features is to align operating units between f2fs and ftl.
> > > Specifically, f2fs has section and zone, which are cleaning unit and basic
> > > allocation unit respectively.
> > > Through these configurable units in f2fs, I think f2fs is able to reduce the
> > > unnecessary operations done by FTL.
> > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> > > itself some bios likewise ext4.
> > Hello.
> > The internal of eMMC and SSD is the blackbox from user side.
> > How does the normal user easily set operating units alignment(page
> > size and physical block size ?) between f2fs and ftl in storage device
> > ?
> 
> I've known that some works have been tried to figure out the units by profiling the storage, AKA reverse engineering.
> In most cases, the simplest way is to measure the latencies of consecutive writes and analyze their patterns.
> As you mentioned, in practical, users will not want to do this, so maybe we need a tool to profile them to optimize f2fs.
> In the current state, I think profiling is an another issue, and mkfs.f2fs had better include this work in the future.
> But, IMO, from the viewpoint of performance, default configuration is quite enough now.
> 
> ps) f2fs doesn't care about the flash page size, but considers garbage collection unit.

I am sorry but this reply makes me smile. How can you design a fs
relying on time attack heuristics to figure out what the proper
layout should be ? Or even endorse such heuristics to be used in
mkfs ? What we should be focusing on is to push vendors to actually
give us such information so we can properly propagate that
throughout the kernel - that's something everyone will benefit from.
After that the optimization can be done in every file system.

Promoting time attack heuristics instead of pushing vendors to tell
us how their hardware should be used is a journey to hell and we've
been talking about this for a looong time now. And I imagine that
you especially have quite some persuasion power.

Thanks!
-Lukas

> 
> > 
> > Thanks.
> > 
> > >
> > >>
> > >> With the best regards,
> > >> Vyacheslav Dubeyko.
> > >>
> > >>
> > >> >>
> > >> >> Marco
> > >> >
> > >> > ---
> > >> > Jaegeuk Kim
> > >> > Samsung
> > >> >
> > >> > --
> > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > >> > in
> > >> > the body of a message to majordomo@vger.kernel.org
> > >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >> > Please read the FAQ at  http://www.tux.org/lkml/
> > >
> > >
> > > ---
> > > Jaegeuk Kim
> > > Samsung
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> 
> 
> ---
> Jaegeuk Kim
> Samsung
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09  8:31                 ` Lukáš Czerner
@ 2012-10-09 10:45                     ` Jaegeuk Kim
  2012-10-10 10:36                   ` David Woodhouse
  1 sibling, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-09 10:45 UTC (permalink / raw)
  To: 'Lukáš Czerner'
  Cc: 'Namjae Jeon', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of
> Luka? Czerner
> Sent: Tuesday, October 09, 2012 5:32 PM
> To: Jaegeuk Kim
> Cc: 'Namjae Jeon'; 'Vyacheslav Dubeyko'; 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> On Mon, 8 Oct 2012, Jaegeuk Kim wrote:
> 
> > Date: Mon, 08 Oct 2012 19:52:03 +0900
> > From: Jaegeuk Kim <jaegeuk.kim@samsung.com>
> > To: 'Namjae Jeon' <linkinjeon@gmail.com>
> > Cc: 'Vyacheslav Dubeyko' <slava@dubeyko.com>,
> >     'Marco Stornelli' <marco.stornelli@gmail.com>,
> >     'Jaegeuk Kim' <jaegeuk.kim@gmail.com>,
> >     'Al Viro' <viro@zeniv.linux.org.uk>, tytso@mit.edu,
> >     gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> >     chur.lee@samsung.com, cm224.lee@samsung.com, jooyoung.hwang@samsung.com,
> >     linux-fsdevel@vger.kernel.org
> > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >
> > > -----Original Message-----
> > > From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> > > Sent: Monday, October 08, 2012 7:00 PM
> > > To: Jaegeuk Kim
> > > Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; tytso@mit.edu;
> > > gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> cm224.lee@samsung.com;
> > > jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> > > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >
> > > 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> > > >> -----Original Message-----
> > > >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > > >> Sent: Sunday, October 07, 2012 9:09 PM
> > > >> To: Jaegeuk Kim
> > > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> > > >> gregkh@linuxfoundation.org; linux-
> > > >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > > >> jooyoung.hwang@samsung.com;
> > > >> linux-fsdevel@vger.kernel.org
> > > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >>
> > > >> Hi,
> > > >>
> > > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > > >>
> > > >> >> -----Original Message-----
> > > >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> > > >> >> Sent: Sunday, October 07, 2012 4:10 PM
> > > >> >> To: Jaegeuk Kim
> > > >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
> > > >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
> > > >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> > > >> >> cm224.lee@samsung.com;
> > > >> jooyoung.hwang@samsung.com;
> > > >> >> linux-fsdevel@vger.kernel.org
> > > >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >> >>
> > > >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > > >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > > >> >>>> Hi Jaegeuk,
> > > >> >>>
> > > >> >>> Hi.
> > > >> >>> We know each other, right? :)
> > > >> >>>
> > > >> >>>>
> > > >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> > > >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> > > >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > > >> >> chur.lee@samsung.com,
> > > >> cm224.lee@samsung.com,
> > > >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > > >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > > >> >>>>>
> > > >> >>>>> This is a new patch set for the f2fs file system.
> > > >> >>>>>
> > > >> >>>>> What is F2FS?
> > > >> >>>>> =============
> > > >> >>>>>
> > > >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD
> > > >> >>>>> cards, have
> > > >> >>>>> been widely being used for ranging from mobile to server systems.
> > > >> >>>>> Since they are
> > > >> >>>>> known to have different characteristics from the conventional
> > > >> >>>>> rotational disks,
> > > >> >>>>> a file system, an upper layer to the storage device, should adapt to
> > > >> >>>>> the changes
> > > >> >>>>> from the sketch.
> > > >> >>>>>
> > > >> >>>>> F2FS is a new file system carefully designed for the NAND flash
> > > >> >>>>> memory-based storage
> > > >> >>>>> devices. We chose a log structure file system approach, but we tried
> > > >> >>>>> to adapt it
> > > >> >>>>> to the new form of storage. Also we remedy some known issues of the
> > > >> >>>>> very old log
> > > >> >>>>> structured file system, such as snowball effect of wandering tree
> > > >> >>>>> and high cleaning
> > > >> >>>>> overhead.
> > > >> >>>>>
> > > >> >>>>> Because a NAND-based storage device shows different characteristics
> > > >> >>>>> according to
> > > >> >>>>> its internal geometry or flash memory management scheme aka FTL, we
> > > >> >>>>> add various
> > > >> >>>>> parameters not only for configuring on-disk layout, but also for
> > > >> >>>>> selecting allocation
> > > >> >>>>> and cleaning algorithms.
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>> What about F2FS performance? Could you share benchmarking results of
> > > >> >>>> the new file system?
> > > >> >>>>
> > > >> >>>> It is very interesting the case of aged file system. How is GC's
> > > >> >>>> implementation efficient? Could
> > > >> >> you share benchmarking results for the very aged file system state?
> > > >> >>>>
> > > >> >>>
> > > >> >>> Although I have benchmark results, currently I'd like to see the
> > > >> >>> results
> > > >> >>> measured by community as a black-box. As you know, the results are
> > > >> >>> very
> > > >> >>> dependent on the workloads and parameters, so I think it would be
> > > >> >>> better
> > > >> >>> to see other results for a while.
> > > >> >>> Thanks,
> > > >> >>>
> > > >> >>
> > > >> >> 1) Actually it's a strange approach. If you have got any results you
> > > >> >> should share them with the community explaining how (the workload, hw
> > > >> >> and so on) your benchmark works and the specific condition. I really
> > > >> >> don't like the approach "I've got the results but I don't say
> > > >> >> anything,
> > > >> >> if you want a number, do it yourself".
> > > >> >
> > > >> > It's definitely right, and I meant *for a while*.
> > > >> > I just wanted to avoid arguing with how to age file system in this
> > > >> > time.
> > > >> > Before then, I share the primitive results as follows.
> > > >> >
> > > >> > 1. iozone in Panda board
> > > >> > - ARM A9
> > > >> > - DRAM : 1GB
> > > >> > - Kernel: Linux 3.3
> > > >> > - Partition: 12GB (64GB Samsung eMMC)
> > > >> > - Tested on 2GB file
> > > >> >
> > > >> >           seq. read, seq. write, rand. read, rand. write
> > > >> > - ext4:    30.753         17.066       5.06         4.15
> > > >> > - f2fs:    30.71          16.906       5.073       15.204
> > > >> >
> > > >> > 2. iozone in Galaxy Nexus
> > > >> > - DRAM : 1GB
> > > >> > - Android 4.0.4_r1.2
> > > >> > - Kernel omap 3.0.8
> > > >> > - Partition: /data, 12GB
> > > >> > - Tested on 2GB file
> > > >> >
> > > >> >           seq. read, seq. write, rand. read,  rand. write
> > > >> > - ext4:    29.88        12.83         11.43          0.56
> > > >> > - f2fs:    29.70        13.34         10.79         12.82
> > > >> >
> > > >>
> > > >>
> > > >> This is results for non-aged filesystem state. Am I correct?
> > > >>
> > > >
> > > > Yes, right.
> > > >
> > > >>
> > > >> > Due to the company secret, I expect to show other results after
> > > >> > presenting f2fs at korea linux forum.
> > > >> >
> > > >> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
> > > >> >
> > > >> > Yes, that was totally my mistake.
> > > >> >
> > > >> >> 3) It's not clear the pros/cons of your filesystem, can you share with
> > > >> >> us the main differences with the current fs already in mainline? Or is
> > > >> >> it a company secret?
> > > >> >
> > > >> > After forum, I can share the slides, and I hope they will be useful to
> > > >> > you.
> > > >> >
> > > >> > Instead, let me summarize at a glance compared with other file systems.
> > > >> > Here are several log-structured file systems.
> > > >> > Note that, F2FS operates on top of block device with consideration on
> > > >> > the FTL behavior.
> > > >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed
> > > >> > for raw NAND flash.
> > > >> > LogFS is initially designed for raw NAND flash, but expanded to block
> > > >> > device.
> > > >> > But, I don't know whether it is stable or not.
> > > >> > NILFS2 is one of major log-structured file systems, which supports
> > > >> > multiple snap-shots.
> > > >> > IMO, that feature is quite promising and important to users, but it may
> > > >> > degrade the performance.
> > > >> > There is a trade-off between functionalities and performance.
> > > >> > F2FS chose high performance without any further fancy functionalities.
> > > >> >
> > > >>
> > > >> Performance is a good goal. But fault-tolerance is also very important
> > > >> point. Filesystems are used by
> > > >> users, so, it is very important to guarantee reliability of data keeping.
> > > >> Degradation of performance
> > > >> by means of snapshots is arguable point. Snapshots can solve the problem
> > > >> not only some unpredictable
> > > >> environmental issues but also user's erroneous behavior.
> > > >>
> > > >
> > > > Yes, I agree. I concerned the multiple snapshot feature.
> > > > Of course, fault-tolerance is very important, and file system should support
> > > > it as you know as power-off-recovery.
> > > > f2fs supports the recovery mechanism by adopting checkpoint similar to
> > > > snapshot.
> > > > But, f2fs does not support multiple snapshots for user convenience.
> > > > I just focused on the performance, and absolutely, the multiple snapshot
> > > > feature is also a good alternative approach.
> > > > That may be a trade-off.
> > > >
> > > >> As I understand, it is not possible to have a perfect performance in all
> > > >> possible workloads. Could you
> > > >> point out what workloads are the best way of F2FS using?
> > > >
> > > > Basically I think the following workloads will be good for F2FS.
> > > > - Many random writes : it's LFS nature
> > > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync
> > > > overhead.
> > > >
> > > >>
> > > >> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
> > > >> > storages.
> > > >> > IMHO, however, they are originally designed for HDDs, so that it may or
> > > >> > may not suffer from
> > > >> fundamental designs.
> > > >> > I don't know, but why not designing a new file system for flash storages
> > > >> > as a counterpart?
> > > >> >
> > > >>
> > > >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2,
> > > >> YAFFS2, UBIFS but block-
> > > >> oriented filesystem. So, F2FS design is restricted by block-layer's
> > > >> opportunities in the using of
> > > >> flash storages' peculiarities. Could you point out key points of F2FS
> > > >> design that makes this design
> > > >> fundamentally unique?
> > > >
> > > > As you can see the f2fs kernel document patch, I think one of the most
> > > > important features is to align operating units between f2fs and ftl.
> > > > Specifically, f2fs has section and zone, which are cleaning unit and basic
> > > > allocation unit respectively.
> > > > Through these configurable units in f2fs, I think f2fs is able to reduce the
> > > > unnecessary operations done by FTL.
> > > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> > > > itself some bios likewise ext4.
> > > Hello.
> > > The internal of eMMC and SSD is the blackbox from user side.
> > > How does the normal user easily set operating units alignment(page
> > > size and physical block size ?) between f2fs and ftl in storage device
> > > ?
> >
> > I've known that some works have been tried to figure out the units by profiling the storage, AKA
> reverse engineering.
> > In most cases, the simplest way is to measure the latencies of consecutive writes and analyze their
> patterns.
> > As you mentioned, in practical, users will not want to do this, so maybe we need a tool to profile
> them to optimize f2fs.
> > In the current state, I think profiling is an another issue, and mkfs.f2fs had better include this
> work in the future.
> > But, IMO, from the viewpoint of performance, default configuration is quite enough now.
> >
> > ps) f2fs doesn't care about the flash page size, but considers garbage collection unit.
> 
> I am sorry but this reply makes me smile. How can you design a fs
> relying on time attack heuristics to figure out what the proper
> layout should be ? Or even endorse such heuristics to be used in
> mkfs ? What we should be focusing on is to push vendors to actually
> give us such information so we can properly propagate that
> throughout the kernel - that's something everyone will benefit from.
> After that the optimization can be done in every file system.
> 

Frankly speaking, I agree that it would be the right direction eventually.
But, as you know, it's very difficult for all flash vendors to promote and standardize that.
Because each vendors have different strategies to open their internal information and also try
to protect their secrets whatever they are.

IMO, we don't need to wait them now.
Instead, from the start, I suggest f2fs that uses those information to the file system design.
In addition, I suggest using heuristics right now as best efforts.
Maybe in future, if vendors give something, f2fs would be more feasible.
In the mean time, I strongly hope to validate and stabilize f2fs with community.

> Promoting time attack heuristics instead of pushing vendors to tell
> us how their hardware should be used is a journey to hell and we've
> been talking about this for a looong time now. And I imagine that
> you especially have quite some persuasion power.

I know. :)
If there comes a chance, I want to try.
Thanks,

> 
> Thanks!
> -Lukas
> 
> >
> > >
> > > Thanks.
> > >
> > > >
> > > >>
> > > >> With the best regards,
> > > >> Vyacheslav Dubeyko.
> > > >>
> > > >>
> > > >> >>
> > > >> >> Marco
> > > >> >
> > > >> > ---
> > > >> > Jaegeuk Kim
> > > >> > Samsung
> > > >> >
> > > >> > --
> > > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > >> > in
> > > >> > the body of a message to majordomo@vger.kernel.org
> > > >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >> > Please read the FAQ at  http://www.tux.org/lkml/
> > > >
> > > >
> > > > ---
> > > > Jaegeuk Kim
> > > > Samsung
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> >
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >



---
Jaegeuk Kim
Samsung



^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-09 10:45                     ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-09 10:45 UTC (permalink / raw)
  To: 'Lukáš Czerner'
  Cc: 'Namjae Jeon', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of
> Luka? Czerner
> Sent: Tuesday, October 09, 2012 5:32 PM
> To: Jaegeuk Kim
> Cc: 'Namjae Jeon'; 'Vyacheslav Dubeyko'; 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> On Mon, 8 Oct 2012, Jaegeuk Kim wrote:
> 
> > Date: Mon, 08 Oct 2012 19:52:03 +0900
> > From: Jaegeuk Kim <jaegeuk.kim@samsung.com>
> > To: 'Namjae Jeon' <linkinjeon@gmail.com>
> > Cc: 'Vyacheslav Dubeyko' <slava@dubeyko.com>,
> >     'Marco Stornelli' <marco.stornelli@gmail.com>,
> >     'Jaegeuk Kim' <jaegeuk.kim@gmail.com>,
> >     'Al Viro' <viro@zeniv.linux.org.uk>, tytso@mit.edu,
> >     gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> >     chur.lee@samsung.com, cm224.lee@samsung.com, jooyoung.hwang@samsung.com,
> >     linux-fsdevel@vger.kernel.org
> > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >
> > > -----Original Message-----
> > > From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> > > Sent: Monday, October 08, 2012 7:00 PM
> > > To: Jaegeuk Kim
> > > Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; tytso@mit.edu;
> > > gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> cm224.lee@samsung.com;
> > > jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> > > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >
> > > 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> > > >> -----Original Message-----
> > > >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > > >> Sent: Sunday, October 07, 2012 9:09 PM
> > > >> To: Jaegeuk Kim
> > > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> > > >> gregkh@linuxfoundation.org; linux-
> > > >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > > >> jooyoung.hwang@samsung.com;
> > > >> linux-fsdevel@vger.kernel.org
> > > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >>
> > > >> Hi,
> > > >>
> > > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > > >>
> > > >> >> -----Original Message-----
> > > >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> > > >> >> Sent: Sunday, October 07, 2012 4:10 PM
> > > >> >> To: Jaegeuk Kim
> > > >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
> > > >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
> > > >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> > > >> >> cm224.lee@samsung.com;
> > > >> jooyoung.hwang@samsung.com;
> > > >> >> linux-fsdevel@vger.kernel.org
> > > >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >> >>
> > > >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > > >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > > >> >>>> Hi Jaegeuk,
> > > >> >>>
> > > >> >>> Hi.
> > > >> >>> We know each other, right? :)
> > > >> >>>
> > > >> >>>>
> > > >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> > > >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> > > >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > > >> >> chur.lee@samsung.com,
> > > >> cm224.lee@samsung.com,
> > > >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > > >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > > >> >>>>>
> > > >> >>>>> This is a new patch set for the f2fs file system.
> > > >> >>>>>
> > > >> >>>>> What is F2FS?
> > > >> >>>>> =============
> > > >> >>>>>
> > > >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD
> > > >> >>>>> cards, have
> > > >> >>>>> been widely being used for ranging from mobile to server systems.
> > > >> >>>>> Since they are
> > > >> >>>>> known to have different characteristics from the conventional
> > > >> >>>>> rotational disks,
> > > >> >>>>> a file system, an upper layer to the storage device, should adapt to
> > > >> >>>>> the changes
> > > >> >>>>> from the sketch.
> > > >> >>>>>
> > > >> >>>>> F2FS is a new file system carefully designed for the NAND flash
> > > >> >>>>> memory-based storage
> > > >> >>>>> devices. We chose a log structure file system approach, but we tried
> > > >> >>>>> to adapt it
> > > >> >>>>> to the new form of storage. Also we remedy some known issues of the
> > > >> >>>>> very old log
> > > >> >>>>> structured file system, such as snowball effect of wandering tree
> > > >> >>>>> and high cleaning
> > > >> >>>>> overhead.
> > > >> >>>>>
> > > >> >>>>> Because a NAND-based storage device shows different characteristics
> > > >> >>>>> according to
> > > >> >>>>> its internal geometry or flash memory management scheme aka FTL, we
> > > >> >>>>> add various
> > > >> >>>>> parameters not only for configuring on-disk layout, but also for
> > > >> >>>>> selecting allocation
> > > >> >>>>> and cleaning algorithms.
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>> What about F2FS performance? Could you share benchmarking results of
> > > >> >>>> the new file system?
> > > >> >>>>
> > > >> >>>> It is very interesting the case of aged file system. How is GC's
> > > >> >>>> implementation efficient? Could
> > > >> >> you share benchmarking results for the very aged file system state?
> > > >> >>>>
> > > >> >>>
> > > >> >>> Although I have benchmark results, currently I'd like to see the
> > > >> >>> results
> > > >> >>> measured by community as a black-box. As you know, the results are
> > > >> >>> very
> > > >> >>> dependent on the workloads and parameters, so I think it would be
> > > >> >>> better
> > > >> >>> to see other results for a while.
> > > >> >>> Thanks,
> > > >> >>>
> > > >> >>
> > > >> >> 1) Actually it's a strange approach. If you have got any results you
> > > >> >> should share them with the community explaining how (the workload, hw
> > > >> >> and so on) your benchmark works and the specific condition. I really
> > > >> >> don't like the approach "I've got the results but I don't say
> > > >> >> anything,
> > > >> >> if you want a number, do it yourself".
> > > >> >
> > > >> > It's definitely right, and I meant *for a while*.
> > > >> > I just wanted to avoid arguing with how to age file system in this
> > > >> > time.
> > > >> > Before then, I share the primitive results as follows.
> > > >> >
> > > >> > 1. iozone in Panda board
> > > >> > - ARM A9
> > > >> > - DRAM : 1GB
> > > >> > - Kernel: Linux 3.3
> > > >> > - Partition: 12GB (64GB Samsung eMMC)
> > > >> > - Tested on 2GB file
> > > >> >
> > > >> >           seq. read, seq. write, rand. read, rand. write
> > > >> > - ext4:    30.753         17.066       5.06         4.15
> > > >> > - f2fs:    30.71          16.906       5.073       15.204
> > > >> >
> > > >> > 2. iozone in Galaxy Nexus
> > > >> > - DRAM : 1GB
> > > >> > - Android 4.0.4_r1.2
> > > >> > - Kernel omap 3.0.8
> > > >> > - Partition: /data, 12GB
> > > >> > - Tested on 2GB file
> > > >> >
> > > >> >           seq. read, seq. write, rand. read,  rand. write
> > > >> > - ext4:    29.88        12.83         11.43          0.56
> > > >> > - f2fs:    29.70        13.34         10.79         12.82
> > > >> >
> > > >>
> > > >>
> > > >> This is results for non-aged filesystem state. Am I correct?
> > > >>
> > > >
> > > > Yes, right.
> > > >
> > > >>
> > > >> > Due to the company secret, I expect to show other results after
> > > >> > presenting f2fs at korea linux forum.
> > > >> >
> > > >> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
> > > >> >
> > > >> > Yes, that was totally my mistake.
> > > >> >
> > > >> >> 3) It's not clear the pros/cons of your filesystem, can you share with
> > > >> >> us the main differences with the current fs already in mainline? Or is
> > > >> >> it a company secret?
> > > >> >
> > > >> > After forum, I can share the slides, and I hope they will be useful to
> > > >> > you.
> > > >> >
> > > >> > Instead, let me summarize at a glance compared with other file systems.
> > > >> > Here are several log-structured file systems.
> > > >> > Note that, F2FS operates on top of block device with consideration on
> > > >> > the FTL behavior.
> > > >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed
> > > >> > for raw NAND flash.
> > > >> > LogFS is initially designed for raw NAND flash, but expanded to block
> > > >> > device.
> > > >> > But, I don't know whether it is stable or not.
> > > >> > NILFS2 is one of major log-structured file systems, which supports
> > > >> > multiple snap-shots.
> > > >> > IMO, that feature is quite promising and important to users, but it may
> > > >> > degrade the performance.
> > > >> > There is a trade-off between functionalities and performance.
> > > >> > F2FS chose high performance without any further fancy functionalities.
> > > >> >
> > > >>
> > > >> Performance is a good goal. But fault-tolerance is also very important
> > > >> point. Filesystems are used by
> > > >> users, so, it is very important to guarantee reliability of data keeping.
> > > >> Degradation of performance
> > > >> by means of snapshots is arguable point. Snapshots can solve the problem
> > > >> not only some unpredictable
> > > >> environmental issues but also user's erroneous behavior.
> > > >>
> > > >
> > > > Yes, I agree. I concerned the multiple snapshot feature.
> > > > Of course, fault-tolerance is very important, and file system should support
> > > > it as you know as power-off-recovery.
> > > > f2fs supports the recovery mechanism by adopting checkpoint similar to
> > > > snapshot.
> > > > But, f2fs does not support multiple snapshots for user convenience.
> > > > I just focused on the performance, and absolutely, the multiple snapshot
> > > > feature is also a good alternative approach.
> > > > That may be a trade-off.
> > > >
> > > >> As I understand, it is not possible to have a perfect performance in all
> > > >> possible workloads. Could you
> > > >> point out what workloads are the best way of F2FS using?
> > > >
> > > > Basically I think the following workloads will be good for F2FS.
> > > > - Many random writes : it's LFS nature
> > > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync
> > > > overhead.
> > > >
> > > >>
> > > >> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
> > > >> > storages.
> > > >> > IMHO, however, they are originally designed for HDDs, so that it may or
> > > >> > may not suffer from
> > > >> fundamental designs.
> > > >> > I don't know, but why not designing a new file system for flash storages
> > > >> > as a counterpart?
> > > >> >
> > > >>
> > > >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2,
> > > >> YAFFS2, UBIFS but block-
> > > >> oriented filesystem. So, F2FS design is restricted by block-layer's
> > > >> opportunities in the using of
> > > >> flash storages' peculiarities. Could you point out key points of F2FS
> > > >> design that makes this design
> > > >> fundamentally unique?
> > > >
> > > > As you can see the f2fs kernel document patch, I think one of the most
> > > > important features is to align operating units between f2fs and ftl.
> > > > Specifically, f2fs has section and zone, which are cleaning unit and basic
> > > > allocation unit respectively.
> > > > Through these configurable units in f2fs, I think f2fs is able to reduce the
> > > > unnecessary operations done by FTL.
> > > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> > > > itself some bios likewise ext4.
> > > Hello.
> > > The internal of eMMC and SSD is the blackbox from user side.
> > > How does the normal user easily set operating units alignment(page
> > > size and physical block size ?) between f2fs and ftl in storage device
> > > ?
> >
> > I've known that some works have been tried to figure out the units by profiling the storage, AKA
> reverse engineering.
> > In most cases, the simplest way is to measure the latencies of consecutive writes and analyze their
> patterns.
> > As you mentioned, in practical, users will not want to do this, so maybe we need a tool to profile
> them to optimize f2fs.
> > In the current state, I think profiling is an another issue, and mkfs.f2fs had better include this
> work in the future.
> > But, IMO, from the viewpoint of performance, default configuration is quite enough now.
> >
> > ps) f2fs doesn't care about the flash page size, but considers garbage collection unit.
> 
> I am sorry but this reply makes me smile. How can you design a fs
> relying on time attack heuristics to figure out what the proper
> layout should be ? Or even endorse such heuristics to be used in
> mkfs ? What we should be focusing on is to push vendors to actually
> give us such information so we can properly propagate that
> throughout the kernel - that's something everyone will benefit from.
> After that the optimization can be done in every file system.
> 

Frankly speaking, I agree that it would be the right direction eventually.
But, as you know, it's very difficult for all flash vendors to promote and standardize that.
Because each vendors have different strategies to open their internal information and also try
to protect their secrets whatever they are.

IMO, we don't need to wait them now.
Instead, from the start, I suggest f2fs that uses those information to the file system design.
In addition, I suggest using heuristics right now as best efforts.
Maybe in future, if vendors give something, f2fs would be more feasible.
In the mean time, I strongly hope to validate and stabilize f2fs with community.

> Promoting time attack heuristics instead of pushing vendors to tell
> us how their hardware should be used is a journey to hell and we've
> been talking about this for a looong time now. And I imagine that
> you especially have quite some persuasion power.

I know. :)
If there comes a chance, I want to try.
Thanks,

> 
> Thanks!
> -Lukas
> 
> >
> > >
> > > Thanks.
> > >
> > > >
> > > >>
> > > >> With the best regards,
> > > >> Vyacheslav Dubeyko.
> > > >>
> > > >>
> > > >> >>
> > > >> >> Marco
> > > >> >
> > > >> > ---
> > > >> > Jaegeuk Kim
> > > >> > Samsung
> > > >> >
> > > >> > --
> > > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > >> > in
> > > >> > the body of a message to majordomo@vger.kernel.org
> > > >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >> > Please read the FAQ at  http://www.tux.org/lkml/
> > > >
> > > >
> > > > ---
> > > > Jaegeuk Kim
> > > > Samsung
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> >
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >



---
Jaegeuk Kim
Samsung


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 10:45                     ` Jaegeuk Kim
  (?)
@ 2012-10-09 11:01                     ` Lukáš Czerner
  2012-10-09 12:01                       ` Jaegeuk Kim
  2012-10-10  4:53                         ` Theodore Ts'o
  -1 siblings, 2 replies; 154+ messages in thread
From: Lukáš Czerner @ 2012-10-09 11:01 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Lukáš Czerner', 'Namjae Jeon',
	'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 17683 bytes --]

On Tue, 9 Oct 2012, Jaegeuk Kim wrote:

> Date: Tue, 09 Oct 2012 19:45:57 +0900
> From: Jaegeuk Kim <jaegeuk.kim@samsung.com>
> To: 'Lukáš Czerner' <lczerner@redhat.com>
> Cc: 'Namjae Jeon' <linkinjeon@gmail.com>,
>     'Vyacheslav Dubeyko' <slava@dubeyko.com>,
>     'Marco Stornelli' <marco.stornelli@gmail.com>,
>     'Jaegeuk Kim' <jaegeuk.kim@gmail.com>,
>     'Al Viro' <viro@zeniv.linux.org.uk>, tytso@mit.edu,
>     gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
>     chur.lee@samsung.com, cm224.lee@samsung.com, jooyoung.hwang@samsung.com,
>     linux-fsdevel@vger.kernel.org
> Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> > -----Original Message-----
> > From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of
> > Luka? Czerner
> > Sent: Tuesday, October 09, 2012 5:32 PM
> > To: Jaegeuk Kim
> > Cc: 'Namjae Jeon'; 'Vyacheslav Dubeyko'; 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> > gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > 
> > On Mon, 8 Oct 2012, Jaegeuk Kim wrote:
> > 
> > > Date: Mon, 08 Oct 2012 19:52:03 +0900
> > > From: Jaegeuk Kim <jaegeuk.kim@samsung.com>
> > > To: 'Namjae Jeon' <linkinjeon@gmail.com>
> > > Cc: 'Vyacheslav Dubeyko' <slava@dubeyko.com>,
> > >     'Marco Stornelli' <marco.stornelli@gmail.com>,
> > >     'Jaegeuk Kim' <jaegeuk.kim@gmail.com>,
> > >     'Al Viro' <viro@zeniv.linux.org.uk>, tytso@mit.edu,
> > >     gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > >     chur.lee@samsung.com, cm224.lee@samsung.com, jooyoung.hwang@samsung.com,
> > >     linux-fsdevel@vger.kernel.org
> > > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >
> > > > -----Original Message-----
> > > > From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> > > > Sent: Monday, October 08, 2012 7:00 PM
> > > > To: Jaegeuk Kim
> > > > Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; tytso@mit.edu;
> > > > gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> > cm224.lee@samsung.com;
> > > > jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> > > > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >
> > > > 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> > > > >> -----Original Message-----
> > > > >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > > > >> Sent: Sunday, October 07, 2012 9:09 PM
> > > > >> To: Jaegeuk Kim
> > > > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> > > > >> gregkh@linuxfoundation.org; linux-
> > > > >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > > > >> jooyoung.hwang@samsung.com;
> > > > >> linux-fsdevel@vger.kernel.org
> > > > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > > >>
> > > > >> Hi,
> > > > >>
> > > > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > > > >>
> > > > >> >> -----Original Message-----
> > > > >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> > > > >> >> Sent: Sunday, October 07, 2012 4:10 PM
> > > > >> >> To: Jaegeuk Kim
> > > > >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
> > > > >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
> > > > >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> > > > >> >> cm224.lee@samsung.com;
> > > > >> jooyoung.hwang@samsung.com;
> > > > >> >> linux-fsdevel@vger.kernel.org
> > > > >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > > >> >>
> > > > >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > > > >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > > > >> >>>> Hi Jaegeuk,
> > > > >> >>>
> > > > >> >>> Hi.
> > > > >> >>> We know each other, right? :)
> > > > >> >>>
> > > > >> >>>>
> > > > >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> > > > >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> > > > >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > > > >> >> chur.lee@samsung.com,
> > > > >> cm224.lee@samsung.com,
> > > > >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > > > >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> > > > >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > > > >> >>>>>
> > > > >> >>>>> This is a new patch set for the f2fs file system.
> > > > >> >>>>>
> > > > >> >>>>> What is F2FS?
> > > > >> >>>>> =============
> > > > >> >>>>>
> > > > >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD
> > > > >> >>>>> cards, have
> > > > >> >>>>> been widely being used for ranging from mobile to server systems.
> > > > >> >>>>> Since they are
> > > > >> >>>>> known to have different characteristics from the conventional
> > > > >> >>>>> rotational disks,
> > > > >> >>>>> a file system, an upper layer to the storage device, should adapt to
> > > > >> >>>>> the changes
> > > > >> >>>>> from the sketch.
> > > > >> >>>>>
> > > > >> >>>>> F2FS is a new file system carefully designed for the NAND flash
> > > > >> >>>>> memory-based storage
> > > > >> >>>>> devices. We chose a log structure file system approach, but we tried
> > > > >> >>>>> to adapt it
> > > > >> >>>>> to the new form of storage. Also we remedy some known issues of the
> > > > >> >>>>> very old log
> > > > >> >>>>> structured file system, such as snowball effect of wandering tree
> > > > >> >>>>> and high cleaning
> > > > >> >>>>> overhead.
> > > > >> >>>>>
> > > > >> >>>>> Because a NAND-based storage device shows different characteristics
> > > > >> >>>>> according to
> > > > >> >>>>> its internal geometry or flash memory management scheme aka FTL, we
> > > > >> >>>>> add various
> > > > >> >>>>> parameters not only for configuring on-disk layout, but also for
> > > > >> >>>>> selecting allocation
> > > > >> >>>>> and cleaning algorithms.
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>> What about F2FS performance? Could you share benchmarking results of
> > > > >> >>>> the new file system?
> > > > >> >>>>
> > > > >> >>>> It is very interesting the case of aged file system. How is GC's
> > > > >> >>>> implementation efficient? Could
> > > > >> >> you share benchmarking results for the very aged file system state?
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>> Although I have benchmark results, currently I'd like to see the
> > > > >> >>> results
> > > > >> >>> measured by community as a black-box. As you know, the results are
> > > > >> >>> very
> > > > >> >>> dependent on the workloads and parameters, so I think it would be
> > > > >> >>> better
> > > > >> >>> to see other results for a while.
> > > > >> >>> Thanks,
> > > > >> >>>
> > > > >> >>
> > > > >> >> 1) Actually it's a strange approach. If you have got any results you
> > > > >> >> should share them with the community explaining how (the workload, hw
> > > > >> >> and so on) your benchmark works and the specific condition. I really
> > > > >> >> don't like the approach "I've got the results but I don't say
> > > > >> >> anything,
> > > > >> >> if you want a number, do it yourself".
> > > > >> >
> > > > >> > It's definitely right, and I meant *for a while*.
> > > > >> > I just wanted to avoid arguing with how to age file system in this
> > > > >> > time.
> > > > >> > Before then, I share the primitive results as follows.
> > > > >> >
> > > > >> > 1. iozone in Panda board
> > > > >> > - ARM A9
> > > > >> > - DRAM : 1GB
> > > > >> > - Kernel: Linux 3.3
> > > > >> > - Partition: 12GB (64GB Samsung eMMC)
> > > > >> > - Tested on 2GB file
> > > > >> >
> > > > >> >           seq. read, seq. write, rand. read, rand. write
> > > > >> > - ext4:    30.753         17.066       5.06         4.15
> > > > >> > - f2fs:    30.71          16.906       5.073       15.204
> > > > >> >
> > > > >> > 2. iozone in Galaxy Nexus
> > > > >> > - DRAM : 1GB
> > > > >> > - Android 4.0.4_r1.2
> > > > >> > - Kernel omap 3.0.8
> > > > >> > - Partition: /data, 12GB
> > > > >> > - Tested on 2GB file
> > > > >> >
> > > > >> >           seq. read, seq. write, rand. read,  rand. write
> > > > >> > - ext4:    29.88        12.83         11.43          0.56
> > > > >> > - f2fs:    29.70        13.34         10.79         12.82
> > > > >> >
> > > > >>
> > > > >>
> > > > >> This is results for non-aged filesystem state. Am I correct?
> > > > >>
> > > > >
> > > > > Yes, right.
> > > > >
> > > > >>
> > > > >> > Due to the company secret, I expect to show other results after
> > > > >> > presenting f2fs at korea linux forum.
> > > > >> >
> > > > >> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
> > > > >> >
> > > > >> > Yes, that was totally my mistake.
> > > > >> >
> > > > >> >> 3) It's not clear the pros/cons of your filesystem, can you share with
> > > > >> >> us the main differences with the current fs already in mainline? Or is
> > > > >> >> it a company secret?
> > > > >> >
> > > > >> > After forum, I can share the slides, and I hope they will be useful to
> > > > >> > you.
> > > > >> >
> > > > >> > Instead, let me summarize at a glance compared with other file systems.
> > > > >> > Here are several log-structured file systems.
> > > > >> > Note that, F2FS operates on top of block device with consideration on
> > > > >> > the FTL behavior.
> > > > >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed
> > > > >> > for raw NAND flash.
> > > > >> > LogFS is initially designed for raw NAND flash, but expanded to block
> > > > >> > device.
> > > > >> > But, I don't know whether it is stable or not.
> > > > >> > NILFS2 is one of major log-structured file systems, which supports
> > > > >> > multiple snap-shots.
> > > > >> > IMO, that feature is quite promising and important to users, but it may
> > > > >> > degrade the performance.
> > > > >> > There is a trade-off between functionalities and performance.
> > > > >> > F2FS chose high performance without any further fancy functionalities.
> > > > >> >
> > > > >>
> > > > >> Performance is a good goal. But fault-tolerance is also very important
> > > > >> point. Filesystems are used by
> > > > >> users, so, it is very important to guarantee reliability of data keeping.
> > > > >> Degradation of performance
> > > > >> by means of snapshots is arguable point. Snapshots can solve the problem
> > > > >> not only some unpredictable
> > > > >> environmental issues but also user's erroneous behavior.
> > > > >>
> > > > >
> > > > > Yes, I agree. I concerned the multiple snapshot feature.
> > > > > Of course, fault-tolerance is very important, and file system should support
> > > > > it as you know as power-off-recovery.
> > > > > f2fs supports the recovery mechanism by adopting checkpoint similar to
> > > > > snapshot.
> > > > > But, f2fs does not support multiple snapshots for user convenience.
> > > > > I just focused on the performance, and absolutely, the multiple snapshot
> > > > > feature is also a good alternative approach.
> > > > > That may be a trade-off.
> > > > >
> > > > >> As I understand, it is not possible to have a perfect performance in all
> > > > >> possible workloads. Could you
> > > > >> point out what workloads are the best way of F2FS using?
> > > > >
> > > > > Basically I think the following workloads will be good for F2FS.
> > > > > - Many random writes : it's LFS nature
> > > > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync
> > > > > overhead.
> > > > >
> > > > >>
> > > > >> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
> > > > >> > storages.
> > > > >> > IMHO, however, they are originally designed for HDDs, so that it may or
> > > > >> > may not suffer from
> > > > >> fundamental designs.
> > > > >> > I don't know, but why not designing a new file system for flash storages
> > > > >> > as a counterpart?
> > > > >> >
> > > > >>
> > > > >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2,
> > > > >> YAFFS2, UBIFS but block-
> > > > >> oriented filesystem. So, F2FS design is restricted by block-layer's
> > > > >> opportunities in the using of
> > > > >> flash storages' peculiarities. Could you point out key points of F2FS
> > > > >> design that makes this design
> > > > >> fundamentally unique?
> > > > >
> > > > > As you can see the f2fs kernel document patch, I think one of the most
> > > > > important features is to align operating units between f2fs and ftl.
> > > > > Specifically, f2fs has section and zone, which are cleaning unit and basic
> > > > > allocation unit respectively.
> > > > > Through these configurable units in f2fs, I think f2fs is able to reduce the
> > > > > unnecessary operations done by FTL.
> > > > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> > > > > itself some bios likewise ext4.
> > > > Hello.
> > > > The internal of eMMC and SSD is the blackbox from user side.
> > > > How does the normal user easily set operating units alignment(page
> > > > size and physical block size ?) between f2fs and ftl in storage device
> > > > ?
> > >
> > > I've known that some works have been tried to figure out the units by profiling the storage, AKA
> > reverse engineering.
> > > In most cases, the simplest way is to measure the latencies of consecutive writes and analyze their
> > patterns.
> > > As you mentioned, in practical, users will not want to do this, so maybe we need a tool to profile
> > them to optimize f2fs.
> > > In the current state, I think profiling is an another issue, and mkfs.f2fs had better include this
> > work in the future.
> > > But, IMO, from the viewpoint of performance, default configuration is quite enough now.
> > >
> > > ps) f2fs doesn't care about the flash page size, but considers garbage collection unit.
> > 
> > I am sorry but this reply makes me smile. How can you design a fs
> > relying on time attack heuristics to figure out what the proper
> > layout should be ? Or even endorse such heuristics to be used in
> > mkfs ? What we should be focusing on is to push vendors to actually
> > give us such information so we can properly propagate that
> > throughout the kernel - that's something everyone will benefit from.
> > After that the optimization can be done in every file system.
> > 
> 
> Frankly speaking, I agree that it would be the right direction eventually.
> But, as you know, it's very difficult for all flash vendors to promote and standardize that.
> Because each vendors have different strategies to open their internal information and also try
> to protect their secrets whatever they are.
> 
> IMO, we don't need to wait them now.
> Instead, from the start, I suggest f2fs that uses those information to the file system design.
> In addition, I suggest using heuristics right now as best efforts.
> Maybe in future, if vendors give something, f2fs would be more feasible.
> In the mean time, I strongly hope to validate and stabilize f2fs with community.

Do not get me wrong, I do not think it is worth to wait for vendors
to come to their senses, but it is worth constantly reminding that
we *need* this kind of information and those heuristics are not
feasible in the long run anyway.

I believe that this conversation happened several times already, but
what about having independent public database of all the internal
information about hw from different vendors where users can add
information gathered by the time attack heuristic so other does not
have to run this again and again. I am not sure if Linaro or someone
else have something like that, someone can maybe post a link to that.

Eventually we can show this to the vendors to see that their
"secrets" are already public anyway and that everyones lives would be
easier if they just agree to provide it from the beginning.

> 
> > Promoting time attack heuristics instead of pushing vendors to tell
> > us how their hardware should be used is a journey to hell and we've
> > been talking about this for a looong time now. And I imagine that
> > you especially have quite some persuasion power.
> 
> I know. :)
> If there comes a chance, I want to try.
> Thanks,

That's very good to hear, thank you.

-Lukas

> 
> > 
> > Thanks!
> > -Lukas
> > 
> > >
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > >>
> > > > >> With the best regards,
> > > > >> Vyacheslav Dubeyko.
> > > > >>
> > > > >>
> > > > >> >>
> > > > >> >> Marco
> > > > >> >
> > > > >> > ---
> > > > >> > Jaegeuk Kim
> > > > >> > Samsung
> > > > >> >
> > > > >> > --
> > > > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > > >> > in
> > > > >> > the body of a message to majordomo@vger.kernel.org
> > > > >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >> > Please read the FAQ at  http://www.tux.org/lkml/
> > > > >
> > > > >
> > > > > ---
> > > > > Jaegeuk Kim
> > > > > Samsung
> > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > >
> > >
> > > ---
> > > Jaegeuk Kim
> > > Samsung
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> 
> 
> 
> ---
> Jaegeuk Kim
> Samsung
> 
> 
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 11:01                     ` Lukáš Czerner
@ 2012-10-09 12:01                       ` Jaegeuk Kim
  2012-10-09 12:39                         ` Lukáš Czerner
  2012-10-09 21:20                           ` Dave Chinner
  2012-10-10  4:53                         ` Theodore Ts'o
  1 sibling, 2 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-09 12:01 UTC (permalink / raw)
  To: 'Lukáš Czerner'
  Cc: 'Namjae Jeon', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: Lukáš Czerner [mailto:lczerner@redhat.com]
> Sent: Tuesday, October 09, 2012 8:01 PM
> To: Jaegeuk Kim
> Cc: 'Lukáš Czerner'; 'Namjae Jeon'; 'Vyacheslav Dubeyko'; 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro';
> tytso@mit.edu; gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> cm224.lee@samsung.com; jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> On Tue, 9 Oct 2012, Jaegeuk Kim wrote:
> 
> > Date: Tue, 09 Oct 2012 19:45:57 +0900
> > From: Jaegeuk Kim <jaegeuk.kim@samsung.com>
> > To: 'Lukáš Czerner' <lczerner@redhat.com>
> > Cc: 'Namjae Jeon' <linkinjeon@gmail.com>,
> >     'Vyacheslav Dubeyko' <slava@dubeyko.com>,
> >     'Marco Stornelli' <marco.stornelli@gmail.com>,
> >     'Jaegeuk Kim' <jaegeuk.kim@gmail.com>,
> >     'Al Viro' <viro@zeniv.linux.org.uk>, tytso@mit.edu,
> >     gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> >     chur.lee@samsung.com, cm224.lee@samsung.com, jooyoung.hwang@samsung.com,
> >     linux-fsdevel@vger.kernel.org
> > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> >
> > > -----Original Message-----
> > > From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf
> Of
> > > Luka? Czerner
> > > Sent: Tuesday, October 09, 2012 5:32 PM
> > > To: Jaegeuk Kim
> > > Cc: 'Namjae Jeon'; 'Vyacheslav Dubeyko'; 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro';
> tytso@mit.edu;
> > > gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> cm224.lee@samsung.com;
> > > jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> > > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >
> > > On Mon, 8 Oct 2012, Jaegeuk Kim wrote:
> > >
> > > > Date: Mon, 08 Oct 2012 19:52:03 +0900
> > > > From: Jaegeuk Kim <jaegeuk.kim@samsung.com>
> > > > To: 'Namjae Jeon' <linkinjeon@gmail.com>
> > > > Cc: 'Vyacheslav Dubeyko' <slava@dubeyko.com>,
> > > >     'Marco Stornelli' <marco.stornelli@gmail.com>,
> > > >     'Jaegeuk Kim' <jaegeuk.kim@gmail.com>,
> > > >     'Al Viro' <viro@zeniv.linux.org.uk>, tytso@mit.edu,
> > > >     gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > > >     chur.lee@samsung.com, cm224.lee@samsung.com, jooyoung.hwang@samsung.com,
> > > >     linux-fsdevel@vger.kernel.org
> > > > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >
> > > > > -----Original Message-----
> > > > > From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> > > > > Sent: Monday, October 08, 2012 7:00 PM
> > > > > To: Jaegeuk Kim
> > > > > Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; tytso@mit.edu;
> > > > > gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> > > cm224.lee@samsung.com;
> > > > > jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> > > > > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > > >
> > > > > 2012/10/8, Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> > > > > >> -----Original Message-----
> > > > > >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > > > > >> Sent: Sunday, October 07, 2012 9:09 PM
> > > > > >> To: Jaegeuk Kim
> > > > > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu;
> > > > > >> gregkh@linuxfoundation.org; linux-
> > > > > >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > > > > >> jooyoung.hwang@samsung.com;
> > > > > >> linux-fsdevel@vger.kernel.org
> > > > > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > > > >>
> > > > > >> Hi,
> > > > > >>
> > > > > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > > > > >>
> > > > > >> >> -----Original Message-----
> > > > > >> >> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> > > > > >> >> Sent: Sunday, October 07, 2012 4:10 PM
> > > > > >> >> To: Jaegeuk Kim
> > > > > >> >> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro;
> > > > > >> >> tytso@mit.edu; gregkh@linuxfoundation.org;
> > > > > >> >> linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> > > > > >> >> cm224.lee@samsung.com;
> > > > > >> jooyoung.hwang@samsung.com;
> > > > > >> >> linux-fsdevel@vger.kernel.org
> > > > > >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > > > >> >>
> > > > > >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > > > > >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > > > > >> >>>> Hi Jaegeuk,
> > > > > >> >>>
> > > > > >> >>> Hi.
> > > > > >> >>> We know each other, right? :)
> > > > > >> >>>
> > > > > >> >>>>
> > > > > >> >>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> > > > > >> >>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> > > > > >> >> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > > > > >> >> chur.lee@samsung.com,
> > > > > >> cm224.lee@samsung.com,
> > > > > >> >> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > > > > >> >>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> > > > > >> >>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > > > > >> >>>>>
> > > > > >> >>>>> This is a new patch set for the f2fs file system.
> > > > > >> >>>>>
> > > > > >> >>>>> What is F2FS?
> > > > > >> >>>>> =============
> > > > > >> >>>>>
> > > > > >> >>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD
> > > > > >> >>>>> cards, have
> > > > > >> >>>>> been widely being used for ranging from mobile to server systems.
> > > > > >> >>>>> Since they are
> > > > > >> >>>>> known to have different characteristics from the conventional
> > > > > >> >>>>> rotational disks,
> > > > > >> >>>>> a file system, an upper layer to the storage device, should adapt to
> > > > > >> >>>>> the changes
> > > > > >> >>>>> from the sketch.
> > > > > >> >>>>>
> > > > > >> >>>>> F2FS is a new file system carefully designed for the NAND flash
> > > > > >> >>>>> memory-based storage
> > > > > >> >>>>> devices. We chose a log structure file system approach, but we tried
> > > > > >> >>>>> to adapt it
> > > > > >> >>>>> to the new form of storage. Also we remedy some known issues of the
> > > > > >> >>>>> very old log
> > > > > >> >>>>> structured file system, such as snowball effect of wandering tree
> > > > > >> >>>>> and high cleaning
> > > > > >> >>>>> overhead.
> > > > > >> >>>>>
> > > > > >> >>>>> Because a NAND-based storage device shows different characteristics
> > > > > >> >>>>> according to
> > > > > >> >>>>> its internal geometry or flash memory management scheme aka FTL, we
> > > > > >> >>>>> add various
> > > > > >> >>>>> parameters not only for configuring on-disk layout, but also for
> > > > > >> >>>>> selecting allocation
> > > > > >> >>>>> and cleaning algorithms.
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>>> What about F2FS performance? Could you share benchmarking results of
> > > > > >> >>>> the new file system?
> > > > > >> >>>>
> > > > > >> >>>> It is very interesting the case of aged file system. How is GC's
> > > > > >> >>>> implementation efficient? Could
> > > > > >> >> you share benchmarking results for the very aged file system state?
> > > > > >> >>>>
> > > > > >> >>>
> > > > > >> >>> Although I have benchmark results, currently I'd like to see the
> > > > > >> >>> results
> > > > > >> >>> measured by community as a black-box. As you know, the results are
> > > > > >> >>> very
> > > > > >> >>> dependent on the workloads and parameters, so I think it would be
> > > > > >> >>> better
> > > > > >> >>> to see other results for a while.
> > > > > >> >>> Thanks,
> > > > > >> >>>
> > > > > >> >>
> > > > > >> >> 1) Actually it's a strange approach. If you have got any results you
> > > > > >> >> should share them with the community explaining how (the workload, hw
> > > > > >> >> and so on) your benchmark works and the specific condition. I really
> > > > > >> >> don't like the approach "I've got the results but I don't say
> > > > > >> >> anything,
> > > > > >> >> if you want a number, do it yourself".
> > > > > >> >
> > > > > >> > It's definitely right, and I meant *for a while*.
> > > > > >> > I just wanted to avoid arguing with how to age file system in this
> > > > > >> > time.
> > > > > >> > Before then, I share the primitive results as follows.
> > > > > >> >
> > > > > >> > 1. iozone in Panda board
> > > > > >> > - ARM A9
> > > > > >> > - DRAM : 1GB
> > > > > >> > - Kernel: Linux 3.3
> > > > > >> > - Partition: 12GB (64GB Samsung eMMC)
> > > > > >> > - Tested on 2GB file
> > > > > >> >
> > > > > >> >           seq. read, seq. write, rand. read, rand. write
> > > > > >> > - ext4:    30.753         17.066       5.06         4.15
> > > > > >> > - f2fs:    30.71          16.906       5.073       15.204
> > > > > >> >
> > > > > >> > 2. iozone in Galaxy Nexus
> > > > > >> > - DRAM : 1GB
> > > > > >> > - Android 4.0.4_r1.2
> > > > > >> > - Kernel omap 3.0.8
> > > > > >> > - Partition: /data, 12GB
> > > > > >> > - Tested on 2GB file
> > > > > >> >
> > > > > >> >           seq. read, seq. write, rand. read,  rand. write
> > > > > >> > - ext4:    29.88        12.83         11.43          0.56
> > > > > >> > - f2fs:    29.70        13.34         10.79         12.82
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> This is results for non-aged filesystem state. Am I correct?
> > > > > >>
> > > > > >
> > > > > > Yes, right.
> > > > > >
> > > > > >>
> > > > > >> > Due to the company secret, I expect to show other results after
> > > > > >> > presenting f2fs at korea linux forum.
> > > > > >> >
> > > > > >> >> 2) For a new filesystem you should send the patches to linux-fsdevel.
> > > > > >> >
> > > > > >> > Yes, that was totally my mistake.
> > > > > >> >
> > > > > >> >> 3) It's not clear the pros/cons of your filesystem, can you share with
> > > > > >> >> us the main differences with the current fs already in mainline? Or is
> > > > > >> >> it a company secret?
> > > > > >> >
> > > > > >> > After forum, I can share the slides, and I hope they will be useful to
> > > > > >> > you.
> > > > > >> >
> > > > > >> > Instead, let me summarize at a glance compared with other file systems.
> > > > > >> > Here are several log-structured file systems.
> > > > > >> > Note that, F2FS operates on top of block device with consideration on
> > > > > >> > the FTL behavior.
> > > > > >> > So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed
> > > > > >> > for raw NAND flash.
> > > > > >> > LogFS is initially designed for raw NAND flash, but expanded to block
> > > > > >> > device.
> > > > > >> > But, I don't know whether it is stable or not.
> > > > > >> > NILFS2 is one of major log-structured file systems, which supports
> > > > > >> > multiple snap-shots.
> > > > > >> > IMO, that feature is quite promising and important to users, but it may
> > > > > >> > degrade the performance.
> > > > > >> > There is a trade-off between functionalities and performance.
> > > > > >> > F2FS chose high performance without any further fancy functionalities.
> > > > > >> >
> > > > > >>
> > > > > >> Performance is a good goal. But fault-tolerance is also very important
> > > > > >> point. Filesystems are used by
> > > > > >> users, so, it is very important to guarantee reliability of data keeping.
> > > > > >> Degradation of performance
> > > > > >> by means of snapshots is arguable point. Snapshots can solve the problem
> > > > > >> not only some unpredictable
> > > > > >> environmental issues but also user's erroneous behavior.
> > > > > >>
> > > > > >
> > > > > > Yes, I agree. I concerned the multiple snapshot feature.
> > > > > > Of course, fault-tolerance is very important, and file system should support
> > > > > > it as you know as power-off-recovery.
> > > > > > f2fs supports the recovery mechanism by adopting checkpoint similar to
> > > > > > snapshot.
> > > > > > But, f2fs does not support multiple snapshots for user convenience.
> > > > > > I just focused on the performance, and absolutely, the multiple snapshot
> > > > > > feature is also a good alternative approach.
> > > > > > That may be a trade-off.
> > > > > >
> > > > > >> As I understand, it is not possible to have a perfect performance in all
> > > > > >> possible workloads. Could you
> > > > > >> point out what workloads are the best way of F2FS using?
> > > > > >
> > > > > > Basically I think the following workloads will be good for F2FS.
> > > > > > - Many random writes : it's LFS nature
> > > > > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync
> > > > > > overhead.
> > > > > >
> > > > > >>
> > > > > >> > Maybe or obviously it is possible to optimize ext4 or btrfs to flash
> > > > > >> > storages.
> > > > > >> > IMHO, however, they are originally designed for HDDs, so that it may or
> > > > > >> > may not suffer from
> > > > > >> fundamental designs.
> > > > > >> > I don't know, but why not designing a new file system for flash storages
> > > > > >> > as a counterpart?
> > > > > >> >
> > > > > >>
> > > > > >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2,
> > > > > >> YAFFS2, UBIFS but block-
> > > > > >> oriented filesystem. So, F2FS design is restricted by block-layer's
> > > > > >> opportunities in the using of
> > > > > >> flash storages' peculiarities. Could you point out key points of F2FS
> > > > > >> design that makes this design
> > > > > >> fundamentally unique?
> > > > > >
> > > > > > As you can see the f2fs kernel document patch, I think one of the most
> > > > > > important features is to align operating units between f2fs and ftl.
> > > > > > Specifically, f2fs has section and zone, which are cleaning unit and basic
> > > > > > allocation unit respectively.
> > > > > > Through these configurable units in f2fs, I think f2fs is able to reduce the
> > > > > > unnecessary operations done by FTL.
> > > > > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> > > > > > itself some bios likewise ext4.
> > > > > Hello.
> > > > > The internal of eMMC and SSD is the blackbox from user side.
> > > > > How does the normal user easily set operating units alignment(page
> > > > > size and physical block size ?) between f2fs and ftl in storage device
> > > > > ?
> > > >
> > > > I've known that some works have been tried to figure out the units by profiling the storage, AKA
> > > reverse engineering.
> > > > In most cases, the simplest way is to measure the latencies of consecutive writes and analyze
> their
> > > patterns.
> > > > As you mentioned, in practical, users will not want to do this, so maybe we need a tool to
> profile
> > > them to optimize f2fs.
> > > > In the current state, I think profiling is an another issue, and mkfs.f2fs had better include
> this
> > > work in the future.
> > > > But, IMO, from the viewpoint of performance, default configuration is quite enough now.
> > > >
> > > > ps) f2fs doesn't care about the flash page size, but considers garbage collection unit.
> > >
> > > I am sorry but this reply makes me smile. How can you design a fs
> > > relying on time attack heuristics to figure out what the proper
> > > layout should be ? Or even endorse such heuristics to be used in
> > > mkfs ? What we should be focusing on is to push vendors to actually
> > > give us such information so we can properly propagate that
> > > throughout the kernel - that's something everyone will benefit from.
> > > After that the optimization can be done in every file system.
> > >
> >
> > Frankly speaking, I agree that it would be the right direction eventually.
> > But, as you know, it's very difficult for all flash vendors to promote and standardize that.
> > Because each vendors have different strategies to open their internal information and also try
> > to protect their secrets whatever they are.
> >
> > IMO, we don't need to wait them now.
> > Instead, from the start, I suggest f2fs that uses those information to the file system design.
> > In addition, I suggest using heuristics right now as best efforts.
> > Maybe in future, if vendors give something, f2fs would be more feasible.
> > In the mean time, I strongly hope to validate and stabilize f2fs with community.
> 
> Do not get me wrong, I do not think it is worth to wait for vendors
> to come to their senses, but it is worth constantly reminding that
> we *need* this kind of information and those heuristics are not
> feasible in the long run anyway.
> 
> I believe that this conversation happened several times already, but
> what about having independent public database of all the internal
> information about hw from different vendors where users can add
> information gathered by the time attack heuristic so other does not
> have to run this again and again. I am not sure if Linaro or someone
> else have something like that, someone can maybe post a link to that.
> 

As I mentioned, I agree to push vendors to open those information all the time.
And, I absolutely didn't mean that it is worth to wait vendors.
I meant, until opening those information by vendors, something like
proposing f2fs or gathering heuristics are also needed simultaneously.

Anyway, it's very interesting to build a database gathering products' information.
May I access the database?

Thanks,

> Eventually we can show this to the vendors to see that their
> "secrets" are already public anyway and that everyones lives would be
> easier if they just agree to provide it from the beginning.
> 
> >
> > > Promoting time attack heuristics instead of pushing vendors to tell
> > > us how their hardware should be used is a journey to hell and we've
> > > been talking about this for a looong time now. And I imagine that
> > > you especially have quite some persuasion power.
> >
> > I know. :)
> > If there comes a chance, I want to try.
> > Thanks,
> 
> That's very good to hear, thank you.
> 
> -Lukas
> 
> >
> > >
> > > Thanks!
> > > -Lukas
> > >
> > > >
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > >>
> > > > > >> With the best regards,
> > > > > >> Vyacheslav Dubeyko.
> > > > > >>
> > > > > >>
> > > > > >> >>
> > > > > >> >> Marco
> > > > > >> >
> > > > > >> > ---
> > > > > >> > Jaegeuk Kim
> > > > > >> > Samsung
> > > > > >> >
> > > > > >> > --
> > > > > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > > > > >> > in
> > > > > >> > the body of a message to majordomo@vger.kernel.org
> > > > > >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > >> > Please read the FAQ at  http://www.tux.org/lkml/
> > > > > >
> > > > > >
> > > > > > ---
> > > > > > Jaegeuk Kim
> > > > > > Samsung
> > > > > >
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > >
> > > >
> > > > ---
> > > > Jaegeuk Kim
> > > > Samsung
> > > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> >
> >
> >
> > ---
> > Jaegeuk Kim
> > Samsung
> >
> >
> >


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 12:01                       ` Jaegeuk Kim
@ 2012-10-09 12:39                         ` Lukáš Czerner
  2012-10-09 13:10                           ` Jaegeuk Kim
  2012-10-09 21:20                           ` Dave Chinner
  1 sibling, 1 reply; 154+ messages in thread
From: Lukáš Czerner @ 2012-10-09 12:39 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Lukáš Czerner', 'Namjae Jeon',
	'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

On Tue, 9 Oct 2012, Jaegeuk Kim wrote:

> > > > > > >
> > > > > > > As you can see the f2fs kernel document patch, I think one of the most
> > > > > > > important features is to align operating units between f2fs and ftl.
> > > > > > > Specifically, f2fs has section and zone, which are cleaning unit and basic
> > > > > > > allocation unit respectively.
> > > > > > > Through these configurable units in f2fs, I think f2fs is able to reduce the
> > > > > > > unnecessary operations done by FTL.
> > > > > > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> > > > > > > itself some bios likewise ext4.
> > > > > > Hello.
> > > > > > The internal of eMMC and SSD is the blackbox from user side.
> > > > > > How does the normal user easily set operating units alignment(page
> > > > > > size and physical block size ?) between f2fs and ftl in storage device
> > > > > > ?
> > > > >
> > > > > I've known that some works have been tried to figure out the units by profiling the storage, AKA
> > > > reverse engineering.
> > > > > In most cases, the simplest way is to measure the latencies of consecutive writes and analyze
> > their
> > > > patterns.
> > > > > As you mentioned, in practical, users will not want to do this, so maybe we need a tool to
> > profile
> > > > them to optimize f2fs.
> > > > > In the current state, I think profiling is an another issue, and mkfs.f2fs had better include
> > this
> > > > work in the future.
> > > > > But, IMO, from the viewpoint of performance, default configuration is quite enough now.
> > > > >
> > > > > ps) f2fs doesn't care about the flash page size, but considers garbage collection unit.
> > > >
> > > > I am sorry but this reply makes me smile. How can you design a fs
> > > > relying on time attack heuristics to figure out what the proper
> > > > layout should be ? Or even endorse such heuristics to be used in
> > > > mkfs ? What we should be focusing on is to push vendors to actually
> > > > give us such information so we can properly propagate that
> > > > throughout the kernel - that's something everyone will benefit from.
> > > > After that the optimization can be done in every file system.
> > > >
> > >
> > > Frankly speaking, I agree that it would be the right direction eventually.
> > > But, as you know, it's very difficult for all flash vendors to promote and standardize that.
> > > Because each vendors have different strategies to open their internal information and also try
> > > to protect their secrets whatever they are.
> > >
> > > IMO, we don't need to wait them now.
> > > Instead, from the start, I suggest f2fs that uses those information to the file system design.
> > > In addition, I suggest using heuristics right now as best efforts.
> > > Maybe in future, if vendors give something, f2fs would be more feasible.
> > > In the mean time, I strongly hope to validate and stabilize f2fs with community.
> > 
> > Do not get me wrong, I do not think it is worth to wait for vendors
> > to come to their senses, but it is worth constantly reminding that
> > we *need* this kind of information and those heuristics are not
> > feasible in the long run anyway.
> > 
> > I believe that this conversation happened several times already, but
> > what about having independent public database of all the internal
> > information about hw from different vendors where users can add
> > information gathered by the time attack heuristic so other does not
> > have to run this again and again. I am not sure if Linaro or someone
> > else have something like that, someone can maybe post a link to that.
> > 
> 
> As I mentioned, I agree to push vendors to open those information all the time.
> And, I absolutely didn't mean that it is worth to wait vendors.
> I meant, until opening those information by vendors, something like
> proposing f2fs or gathering heuristics are also needed simultaneously.
> 
> Anyway, it's very interesting to build a database gathering products' information.
> May I access the database?

That's what I found:

https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey

-Lukas

> 
> Thanks,
> 
> > Eventually we can show this to the vendors to see that their
> > "secrets" are already public anyway and that everyones lives would be
> > easier if they just agree to provide it from the beginning.
> > 
> > >
> > > > Promoting time attack heuristics instead of pushing vendors to tell
> > > > us how their hardware should be used is a journey to hell and we've
> > > > been talking about this for a looong time now. And I imagine that
> > > > you especially have quite some persuasion power.
> > >
> > > I know. :)
> > > If there comes a chance, I want to try.
> > > Thanks,
> > 
> > That's very good to hear, thank you.
> > 
> > -Lukas

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 12:39                         ` Lukáš Czerner
@ 2012-10-09 13:10                           ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-09 13:10 UTC (permalink / raw)
  To: Luk Czerner
  Cc: Jaegeuk Kim, 'Namjae Jeon', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

2012-10-09 (화), 14:39 +0200, Lukáš Czerner:
> On Tue, 9 Oct 2012, Jaegeuk Kim wrote:
> 
> > > > > > > >
> > > > > > > > As you can see the f2fs kernel document patch, I think one of the most
> > > > > > > > important features is to align operating units between f2fs and ftl.
> > > > > > > > Specifically, f2fs has section and zone, which are cleaning unit and basic
> > > > > > > > allocation unit respectively.
> > > > > > > > Through these configurable units in f2fs, I think f2fs is able to reduce the
> > > > > > > > unnecessary operations done by FTL.
> > > > > > > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges
> > > > > > > > itself some bios likewise ext4.
> > > > > > > Hello.
> > > > > > > The internal of eMMC and SSD is the blackbox from user side.
> > > > > > > How does the normal user easily set operating units alignment(page
> > > > > > > size and physical block size ?) between f2fs and ftl in storage device
> > > > > > > ?
> > > > > >
> > > > > > I've known that some works have been tried to figure out the units by profiling the storage, AKA
> > > > > reverse engineering.
> > > > > > In most cases, the simplest way is to measure the latencies of consecutive writes and analyze
> > > their
> > > > > patterns.
> > > > > > As you mentioned, in practical, users will not want to do this, so maybe we need a tool to
> > > profile
> > > > > them to optimize f2fs.
> > > > > > In the current state, I think profiling is an another issue, and mkfs.f2fs had better include
> > > this
> > > > > work in the future.
> > > > > > But, IMO, from the viewpoint of performance, default configuration is quite enough now.
> > > > > >
> > > > > > ps) f2fs doesn't care about the flash page size, but considers garbage collection unit.
> > > > >
> > > > > I am sorry but this reply makes me smile. How can you design a fs
> > > > > relying on time attack heuristics to figure out what the proper
> > > > > layout should be ? Or even endorse such heuristics to be used in
> > > > > mkfs ? What we should be focusing on is to push vendors to actually
> > > > > give us such information so we can properly propagate that
> > > > > throughout the kernel - that's something everyone will benefit from.
> > > > > After that the optimization can be done in every file system.
> > > > >
> > > >
> > > > Frankly speaking, I agree that it would be the right direction eventually.
> > > > But, as you know, it's very difficult for all flash vendors to promote and standardize that.
> > > > Because each vendors have different strategies to open their internal information and also try
> > > > to protect their secrets whatever they are.
> > > >
> > > > IMO, we don't need to wait them now.
> > > > Instead, from the start, I suggest f2fs that uses those information to the file system design.
> > > > In addition, I suggest using heuristics right now as best efforts.
> > > > Maybe in future, if vendors give something, f2fs would be more feasible.
> > > > In the mean time, I strongly hope to validate and stabilize f2fs with community.
> > > 
> > > Do not get me wrong, I do not think it is worth to wait for vendors
> > > to come to their senses, but it is worth constantly reminding that
> > > we *need* this kind of information and those heuristics are not
> > > feasible in the long run anyway.
> > > 
> > > I believe that this conversation happened several times already, but
> > > what about having independent public database of all the internal
> > > information about hw from different vendors where users can add
> > > information gathered by the time attack heuristic so other does not
> > > have to run this again and again. I am not sure if Linaro or someone
> > > else have something like that, someone can maybe post a link to that.
> > > 
> > 
> > As I mentioned, I agree to push vendors to open those information all the time.
> > And, I absolutely didn't mean that it is worth to wait vendors.
> > I meant, until opening those information by vendors, something like
> > proposing f2fs or gathering heuristics are also needed simultaneously.
> > 
> > Anyway, it's very interesting to build a database gathering products' information.
> > May I access the database?
> 
> That's what I found:
> 
> https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey
> 

It is very good information when users configure f2fs according to their
storages.
Thank you.

-Jaegeuk Kim

> -Lukas
> 
> > 
> > Thanks,
> > 
> > > Eventually we can show this to the vendors to see that their
> > > "secrets" are already public anyway and that everyones lives would be
> > > easier if they just agree to provide it from the beginning.
> > > 
> > > >
> > > > > Promoting time attack heuristics instead of pushing vendors to tell
> > > > > us how their hardware should be used is a journey to hell and we've
> > > > > been talking about this for a looong time now. And I imagine that
> > > > > you especially have quite some persuasion power.
> > > >
> > > > I know. :)
> > > > If there comes a chance, I want to try.
> > > > Thanks,
> > > 
> > > That's very good to hear, thank you.
> > > 
> > > -Lukas

-- 
Jaegeuk Kim
Samsung


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09  7:08                 ` Jaegeuk Kim
@ 2012-10-09 19:53                   ` Jooyoung Hwang
  -1 siblings, 0 replies; 154+ messages in thread
From: Jooyoung Hwang @ 2012-10-09 19:53 UTC (permalink / raw)
  To: 'Vyacheslav Dubeyko'
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, linux-fsdevel

On Tue, 2012-10-09 at 16:08 +0900, Jaegeuk Kim wrote:
> > -----Original Message-----
> > From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > Sent: Tuesday, October 09, 2012 4:23 AM
> > To: Jaegeuk Kim
> > Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> > kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> > linux-fsdevel@vger.kernel.org
> > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > 
> > Hi,
> > 
> > On Oct 8, 2012, at 12:25 PM, Jaegeuk Kim wrote:
> > 
> > >> -----Original Message-----
> > >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > >> Sent: Sunday, October 07, 2012 9:09 PM
> > >> To: Jaegeuk Kim
> > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> > >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> > >> linux-fsdevel@vger.kernel.org
> > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >>
> > >> Hi,
> > >>
> > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > >>
> > >>>> -----Original Message-----
> > >>>> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> > >>>> Sent: Sunday, October 07, 2012 4:10 PM
> > >>>> To: Jaegeuk Kim
> > >>>> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu;
> > gregkh@linuxfoundation.org;
> > >>>> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > >> jooyoung.hwang@samsung.com;
> > >>>> linux-fsdevel@vger.kernel.org
> > >>>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >>>>
> > >>>> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > >>>>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > >>>>>> Hi Jaegeuk,
> > >>>>>
> > >>>>> Hi.
> > >>>>> We know each other, right? :)
> > >>>>>
> > >>>>>>
> > >>>>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> > >>>>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> > >>>> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com,
> > >> cm224.lee@samsung.com,
> > >>>> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > >>>>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> > >>>>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > >>>>>>>
> > >>>>>>> This is a new patch set for the f2fs file system.
> > >>>>>>>
> > >>>>>>> What is F2FS?
> > >>>>>>> =============
> > >>>>>>>
> > >>>>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> > >>>>>>> been widely being used for ranging from mobile to server systems. Since they are
> > >>>>>>> known to have different characteristics from the conventional rotational disks,
> > >>>>>>> a file system, an upper layer to the storage device, should adapt to the changes
> > >>>>>>> from the sketch.
> > >>>>>>>
> > >>>>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> > >>>>>>> devices. We chose a log structure file system approach, but we tried to adapt it
> > >>>>>>> to the new form of storage. Also we remedy some known issues of the very old log
> > >>>>>>> structured file system, such as snowball effect of wandering tree and high cleaning
> > >>>>>>> overhead.
> > >>>>>>>
> > >>>>>>> Because a NAND-based storage device shows different characteristics according to
> > >>>>>>> its internal geometry or flash memory management scheme aka FTL, we add various
> > >>>>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
> > >>>>>>> and cleaning algorithms.
> > >>>>>>>
> > >>>>>>
> > >>>>>> What about F2FS performance? Could you share benchmarking results of the new file system?
> > >>>>>>
> > >>>>>> It is very interesting the case of aged file system. How is GC's implementation efficient?
> > Could
> > >>>> you share benchmarking results for the very aged file system state?
> > >>>>>>
> > >>>>>
> > >>>>> Although I have benchmark results, currently I'd like to see the results
> > >>>>> measured by community as a black-box. As you know, the results are very
> > >>>>> dependent on the workloads and parameters, so I think it would be better
> > >>>>> to see other results for a while.
> > >>>>> Thanks,
> > >>>>>
> > >>>>
> > >>>> 1) Actually it's a strange approach. If you have got any results you
> > >>>> should share them with the community explaining how (the workload, hw
> > >>>> and so on) your benchmark works and the specific condition. I really
> > >>>> don't like the approach "I've got the results but I don't say anything,
> > >>>> if you want a number, do it yourself".
> > >>>
> > >>> It's definitely right, and I meant *for a while*.
> > >>> I just wanted to avoid arguing with how to age file system in this time.
> > >>> Before then, I share the primitive results as follows.
> > >>>
> > >>> 1. iozone in Panda board
> > >>> - ARM A9
> > >>> - DRAM : 1GB
> > >>> - Kernel: Linux 3.3
> > >>> - Partition: 12GB (64GB Samsung eMMC)
> > >>> - Tested on 2GB file
> > >>>
> > >>>          seq. read, seq. write, rand. read, rand. write
> > >>> - ext4:    30.753         17.066       5.06         4.15
> > >>> - f2fs:    30.71          16.906       5.073       15.204
> > >>>
> > >>> 2. iozone in Galaxy Nexus
> > >>> - DRAM : 1GB
> > >>> - Android 4.0.4_r1.2
> > >>> - Kernel omap 3.0.8
> > >>> - Partition: /data, 12GB
> > >>> - Tested on 2GB file
> > >>>
> > >>>          seq. read, seq. write, rand. read,  rand. write
> > >>> - ext4:    29.88        12.83         11.43          0.56
> > >>> - f2fs:    29.70        13.34         10.79         12.82
> > >>>
> > >>
> > >>
> > >> This is results for non-aged filesystem state. Am I correct?
> > >>
> > >
> > > Yes, right.
> > >
> > >>
> > >>> Due to the company secret, I expect to show other results after presenting f2fs at korea linux
> > forum.
> > >>>
> > >>>> 2) For a new filesystem you should send the patches to linux-fsdevel.
> > >>>
> > >>> Yes, that was totally my mistake.
> > >>>
> > >>>> 3) It's not clear the pros/cons of your filesystem, can you share with
> > >>>> us the main differences with the current fs already in mainline? Or is
> > >>>> it a company secret?
> > >>>
> > >>> After forum, I can share the slides, and I hope they will be useful to you.
> > >>>
> > >>> Instead, let me summarize at a glance compared with other file systems.
> > >>> Here are several log-structured file systems.
> > >>> Note that, F2FS operates on top of block device with consideration on the FTL behavior.
> > >>> So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
> > >>> LogFS is initially designed for raw NAND flash, but expanded to block device.
> > >>> But, I don't know whether it is stable or not.
> > >>> NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> > >>> IMO, that feature is quite promising and important to users, but it may degrade the performance.
> > >>> There is a trade-off between functionalities and performance.
> > >>> F2FS chose high performance without any further fancy functionalities.
> > >>>
> > >>
> > >> Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used
> > by
> > >> users, so, it is very important to guarantee reliability of data keeping. Degradation of
> > performance
> > >> by means of snapshots is arguable point. Snapshots can solve the problem not only some
> > unpredictable
> > >> environmental issues but also user's erroneous behavior.
> > >>
> > >
> > > Yes, I agree. I concerned the multiple snapshot feature.
> > > Of course, fault-tolerance is very important, and file system should support it as you know as
> > power-off-recovery.
> > > f2fs supports the recovery mechanism by adopting checkpoint similar to snapshot.
> > > But, f2fs does not support multiple snapshots for user convenience.
> > > I just focused on the performance, and absolutely, the multiple snapshot feature is also a good
> > alternative approach.
> > > That may be a trade-off.
> > 
> > So, maybe I misunderstand something, but I can't understand the difference. As I know, snapshot in
> > NILFS2 is a checkpoint converted by user in snapshot. So, NILFS2's checkpoint is a log that adds new
> > file system's state changing (user data + metadata). In other words, checkpoint is mechanism of
> > writing on volume. Moreover, NILFS2 gives flexible way of checkpoint/snapshot management.
> > 
> > As you are saying, f2fs supports checkpoints also. It means for me that checkpoints are the basic
> > mechanism of writing operations on f2fs. But, about what performance gain and difference do you talk?
> 
> How about the following scenario?
> 1. data "a" is newly written.
> 2. checkpoint "A" is done.
> 3. data "a" is truncated.
> 4. checkpoint "B" is done.
> 
> If fs supports multiple snapshots like "A" and "B" to users, it cannot reuse the space allocated by
> data "a" after checkpoint "B" even though data "a" is safely truncated by checkpoint "B".
> This is because fs should keep data "a" to prepare a roll-back to "A".
> So, even though user sees some free space, LFS may suffer from cleaning due to the exhausted free space.
> If users want to avoid this, they have to remove snapshots by themselves. Or, maybe automatically?
> 
> > 
> > Moreover, user can't manage by f2fs checkpoints completely, as I can understand. It is not so clear
> > what critical points can be a starting points of recovery actions. How is it possible to define how
> > many checkpoints f2fs volume will have?
> 
> IMHO, user does not need to know how many snapshots there exist and track the fs utilization all the time.
> (off list: I don't know why cleaning process should be tuned by users.)
> 
> f2fs writes two checkpoints alternatively. One is for the last stable checkpoint and another is for next checkpoint.
> So, during the recovery, f2fs starts to find one of the latest stable checkpoint.
> The stable checkpoint must have whole index structures and data consistently.
> As you knew, many things can be found in the following LFS paper.
> http://www.cs.berkeley.edu/~brewer/cs262/LFS.pdf
> 
> 
> > 
> > How many user data (metadata) can be lost in the case of sudden power off? Is it possible to estimate
> > this?
> > 
> 
> If user calls sync, f2fs via vfs writes all the data, and it writes a checkpoint.
> In that case, all the data are safe.
> After sync, several fsync can be triggered, and it occurs sudden power off.
> In that case, f2fs first performs roll-back to the last stable checkpoint among two, and then roll-forward to recover fsync'ed data only.
> So, f2fs recovers data triggered by sync or fsync only.
> 
> > >
> > >> As I understand, it is not possible to have a perfect performance in all possible workloads. Could
> > you
> > >> point out what workloads are the best way of F2FS using?
> > >
> > > Basically I think the following workloads will be good for F2FS.
> > > - Many random writes : it's LFS nature
> > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.
> > >
> > 
> > Yes, it can be so for the case of non-aged f2fs volume. But I am afraid that for the case of aged f2fs
> > volume the situation can be opposite. I think that in the case of aged state of f2fs volume the GC
> > will be under hard work in above-mentioned workloads.
> 
> Yes, you're right.
> In the LFS paper above, there are two logging schemes: threaded logging and copy-and-compaction.
> In order to avoid high cleaning overhead, f2fs adopts a hybrid one which changes the allocation policy dynamically
> between two schemes.
> Threaded logging is similar to the traditional approach, resulting in random writes without cleaning operations.
> Copy-and-compaction is another name of cleaning, resulting in sequential writes with cleaning operations.
> So, f2fs adopts one of them in runtime according to the file system status.
> Through this, we could see the random write performance comparable to ext4 even in the worst case.
> 
> > 
> > But, as I can understand, smartphones and tablets are the most promising way of f2fs using. Because
> > f2fs designs for NAND flash memory based-storage devices. So, I think that such workloads as "many
> > random writes" or "small writes with frequent fsync" are not so frequent use-cases. Use-case of
> > creation and deletion many small files can be more frequent use-case under smartphones and tablets.
> > But, as I can understand, f2fs has slightly expensive metadata payload in the case of small files
> > creation. Moreover, frequent and random deletion of small files ends in the very sophisticated and
> > unpredictable GC behavior, as I can understand.
> > 
> 
> I'd like to share the following paper.
> http://research.cs.wisc.edu/adsl/Publications/ibench-tocs12.pdf
> 
> In our experiments *also* on android phones, we've seen many random patterns with frequent fsync calls.
> We found that the main problem is database, and I think f2fs is beneficial to this.
> As you mentioned, I agree that it is important to handle many small files too.
> It is right that this may cause additional cleaning overhead, and f2fs has some metadata payload overhead.
> In order to reduce the cleaning overhead, f2fs adopts static and dynamic hot and cold data separation.
> The main goal is to split the data according to their type (e.g., dir inode, file inode, dentry data, etc) as much as possible.
> Please see the document in detail.
> I think this approach is quite effective to achieve the goal.
> BTW, the payload overhead can be resolved by adopting embedding data in the inode likewise ext4.
> I think it is also good idea, and I hope to adopt it in future.
> 

I'd like you to refer to the following link as well which is about
mobile workload pattern.
http://www.cs.cmu.edu/~fuyaoz/courses/15712/report.pdf
It's reported that in Android there are frequent issues of fsync and
most of them are only for small size of data.

To provide efficient fsync, F2FS minimizes the amount of metadata
written to serve a fsync. Fsync in F2FS is completed by writing user
data blocks and direct node blocks which point to them rather than
creating a new checkpoint which would incur more I/O loads. 
If sudden power failure happens, then F2FS recovery routine rolls back
to the latest checkpoint and thereafter recovers file system state to
reflect all the completed fsync operations, which we call roll-forward
recovery.
You may want to look at the code about the roll-forward in recover_fsync_data().

> > >>
> > >>> Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
> > >>> IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from
> > >> fundamental designs.
> > >>> I don't know, but why not designing a new file system for flash storages as a counterpart?
> > >>>
> > >>
> > >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-
> > >> oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of
> > >> flash storages' peculiarities. Could you point out key points of F2FS design that makes this design
> > >> fundamentally unique?
> > >
> > > As you can see the f2fs kernel document patch, I think one of the most important features is to
> > align operating units between f2fs and ftl.
> > > Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit
> > respectively.
> > > Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary operations
> > done by FTL.
> > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios
> > likewise ext4.
> > >
> > 
> > As I can understand, it is not so easy to create partition with f2fs volume which is aligned on
> > operating units (especially in the case of eMMC or SSD).
> 
> Could you explain why it is not so easy?
> 
> > Performance of unaligned volume can degrade
> > significantly because of FTL activity. What mechanisms has f2fs for excluding such situation and
> > achieving of the goal to reduce unnecessary FTL operations?
> 
> Could you please explain your concern more exactly?
> In the kernel doc, the start address of f2fs data structure is aligned to the segment size (i.e., 2MB).
> Do you mean that or another operating units (e.g., section and zone)?
> 
> Thanks,
> 
> > 
> > With the best regards,
> > Vyacheslav Dubeyko.
> > 
> > >>
> > >> With the best regards,
> > >> Vyacheslav Dubeyko.
> > >>
> > >>
> > >>>>
> > >>>> Marco
> > >>>
> > >>> ---
> > >>> Jaegeuk Kim
> > >>> Samsung
> > >>>
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > >>> the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>> Please read the FAQ at  http://www.tux.org/lkml/
> > >
> > >
> > > ---
> > > Jaegeuk Kim
> > > Samsung
> > >
> 
> 
> ---
> Jaegeuk Kim
> Samsung
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
Jooyoung Hwang
Samsung Electronics


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-09 19:53                   ` Jooyoung Hwang
  0 siblings, 0 replies; 154+ messages in thread
From: Jooyoung Hwang @ 2012-10-09 19:53 UTC (permalink / raw)
  To: 'Vyacheslav Dubeyko'
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, linux-fsdevel

On Tue, 2012-10-09 at 16:08 +0900, Jaegeuk Kim wrote:
> > -----Original Message-----
> > From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > Sent: Tuesday, October 09, 2012 4:23 AM
> > To: Jaegeuk Kim
> > Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> > kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> > linux-fsdevel@vger.kernel.org
> > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > 
> > Hi,
> > 
> > On Oct 8, 2012, at 12:25 PM, Jaegeuk Kim wrote:
> > 
> > >> -----Original Message-----
> > >> From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > >> Sent: Sunday, October 07, 2012 9:09 PM
> > >> To: Jaegeuk Kim
> > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> > >> kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> > >> linux-fsdevel@vger.kernel.org
> > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >>
> > >> Hi,
> > >>
> > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > >>
> > >>>> -----Original Message-----
> > >>>> From: Marco Stornelli [mailto:marco.stornelli@gmail.com]
> > >>>> Sent: Sunday, October 07, 2012 4:10 PM
> > >>>> To: Jaegeuk Kim
> > >>>> Cc: Vyacheslav Dubeyko; jaegeuk.kim@samsung.com; Al Viro; tytso@mit.edu;
> > gregkh@linuxfoundation.org;
> > >>>> linux-kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com;
> > >> jooyoung.hwang@samsung.com;
> > >>>> linux-fsdevel@vger.kernel.org
> > >>>> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >>>>
> > >>>> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > >>>>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > >>>>>> Hi Jaegeuk,
> > >>>>>
> > >>>>> Hi.
> > >>>>> We know each other, right? :)
> > >>>>>
> > >>>>>>
> > >>>>>>> From:	 	김재극 <jaegeuk.kim@samsung.com>
> > >>>>>>> To:	 	viro@zeniv.linux.org.uk, 'Theodore Ts'o' <tytso@mit.edu>,
> > >>>> gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, chur.lee@samsung.com,
> > >> cm224.lee@samsung.com,
> > >>>> jaegeuk.kim@samsung.com, jooyoung.hwang@samsung.com
> > >>>>>>> Subject:	 	[PATCH 00/16] f2fs: introduce flash-friendly file system
> > >>>>>>> Date:	 	Fri, 05 Oct 2012 20:55:07 +0900
> > >>>>>>>
> > >>>>>>> This is a new patch set for the f2fs file system.
> > >>>>>>>
> > >>>>>>> What is F2FS?
> > >>>>>>> =============
> > >>>>>>>
> > >>>>>>> NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> > >>>>>>> been widely being used for ranging from mobile to server systems. Since they are
> > >>>>>>> known to have different characteristics from the conventional rotational disks,
> > >>>>>>> a file system, an upper layer to the storage device, should adapt to the changes
> > >>>>>>> from the sketch.
> > >>>>>>>
> > >>>>>>> F2FS is a new file system carefully designed for the NAND flash memory-based storage
> > >>>>>>> devices. We chose a log structure file system approach, but we tried to adapt it
> > >>>>>>> to the new form of storage. Also we remedy some known issues of the very old log
> > >>>>>>> structured file system, such as snowball effect of wandering tree and high cleaning
> > >>>>>>> overhead.
> > >>>>>>>
> > >>>>>>> Because a NAND-based storage device shows different characteristics according to
> > >>>>>>> its internal geometry or flash memory management scheme aka FTL, we add various
> > >>>>>>> parameters not only for configuring on-disk layout, but also for selecting allocation
> > >>>>>>> and cleaning algorithms.
> > >>>>>>>
> > >>>>>>
> > >>>>>> What about F2FS performance? Could you share benchmarking results of the new file system?
> > >>>>>>
> > >>>>>> It is very interesting the case of aged file system. How is GC's implementation efficient?
> > Could
> > >>>> you share benchmarking results for the very aged file system state?
> > >>>>>>
> > >>>>>
> > >>>>> Although I have benchmark results, currently I'd like to see the results
> > >>>>> measured by community as a black-box. As you know, the results are very
> > >>>>> dependent on the workloads and parameters, so I think it would be better
> > >>>>> to see other results for a while.
> > >>>>> Thanks,
> > >>>>>
> > >>>>
> > >>>> 1) Actually it's a strange approach. If you have got any results you
> > >>>> should share them with the community explaining how (the workload, hw
> > >>>> and so on) your benchmark works and the specific condition. I really
> > >>>> don't like the approach "I've got the results but I don't say anything,
> > >>>> if you want a number, do it yourself".
> > >>>
> > >>> It's definitely right, and I meant *for a while*.
> > >>> I just wanted to avoid arguing with how to age file system in this time.
> > >>> Before then, I share the primitive results as follows.
> > >>>
> > >>> 1. iozone in Panda board
> > >>> - ARM A9
> > >>> - DRAM : 1GB
> > >>> - Kernel: Linux 3.3
> > >>> - Partition: 12GB (64GB Samsung eMMC)
> > >>> - Tested on 2GB file
> > >>>
> > >>>          seq. read, seq. write, rand. read, rand. write
> > >>> - ext4:    30.753         17.066       5.06         4.15
> > >>> - f2fs:    30.71          16.906       5.073       15.204
> > >>>
> > >>> 2. iozone in Galaxy Nexus
> > >>> - DRAM : 1GB
> > >>> - Android 4.0.4_r1.2
> > >>> - Kernel omap 3.0.8
> > >>> - Partition: /data, 12GB
> > >>> - Tested on 2GB file
> > >>>
> > >>>          seq. read, seq. write, rand. read,  rand. write
> > >>> - ext4:    29.88        12.83         11.43          0.56
> > >>> - f2fs:    29.70        13.34         10.79         12.82
> > >>>
> > >>
> > >>
> > >> This is results for non-aged filesystem state. Am I correct?
> > >>
> > >
> > > Yes, right.
> > >
> > >>
> > >>> Due to the company secret, I expect to show other results after presenting f2fs at korea linux
> > forum.
> > >>>
> > >>>> 2) For a new filesystem you should send the patches to linux-fsdevel.
> > >>>
> > >>> Yes, that was totally my mistake.
> > >>>
> > >>>> 3) It's not clear the pros/cons of your filesystem, can you share with
> > >>>> us the main differences with the current fs already in mainline? Or is
> > >>>> it a company secret?
> > >>>
> > >>> After forum, I can share the slides, and I hope they will be useful to you.
> > >>>
> > >>> Instead, let me summarize at a glance compared with other file systems.
> > >>> Here are several log-structured file systems.
> > >>> Note that, F2FS operates on top of block device with consideration on the FTL behavior.
> > >>> So, JFFS2, YAFFS2, and UBIFS are out-of scope, since they are designed for raw NAND flash.
> > >>> LogFS is initially designed for raw NAND flash, but expanded to block device.
> > >>> But, I don't know whether it is stable or not.
> > >>> NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> > >>> IMO, that feature is quite promising and important to users, but it may degrade the performance.
> > >>> There is a trade-off between functionalities and performance.
> > >>> F2FS chose high performance without any further fancy functionalities.
> > >>>
> > >>
> > >> Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used
> > by
> > >> users, so, it is very important to guarantee reliability of data keeping. Degradation of
> > performance
> > >> by means of snapshots is arguable point. Snapshots can solve the problem not only some
> > unpredictable
> > >> environmental issues but also user's erroneous behavior.
> > >>
> > >
> > > Yes, I agree. I concerned the multiple snapshot feature.
> > > Of course, fault-tolerance is very important, and file system should support it as you know as
> > power-off-recovery.
> > > f2fs supports the recovery mechanism by adopting checkpoint similar to snapshot.
> > > But, f2fs does not support multiple snapshots for user convenience.
> > > I just focused on the performance, and absolutely, the multiple snapshot feature is also a good
> > alternative approach.
> > > That may be a trade-off.
> > 
> > So, maybe I misunderstand something, but I can't understand the difference. As I know, snapshot in
> > NILFS2 is a checkpoint converted by user in snapshot. So, NILFS2's checkpoint is a log that adds new
> > file system's state changing (user data + metadata). In other words, checkpoint is mechanism of
> > writing on volume. Moreover, NILFS2 gives flexible way of checkpoint/snapshot management.
> > 
> > As you are saying, f2fs supports checkpoints also. It means for me that checkpoints are the basic
> > mechanism of writing operations on f2fs. But, about what performance gain and difference do you talk?
> 
> How about the following scenario?
> 1. data "a" is newly written.
> 2. checkpoint "A" is done.
> 3. data "a" is truncated.
> 4. checkpoint "B" is done.
> 
> If fs supports multiple snapshots like "A" and "B" to users, it cannot reuse the space allocated by
> data "a" after checkpoint "B" even though data "a" is safely truncated by checkpoint "B".
> This is because fs should keep data "a" to prepare a roll-back to "A".
> So, even though user sees some free space, LFS may suffer from cleaning due to the exhausted free space.
> If users want to avoid this, they have to remove snapshots by themselves. Or, maybe automatically?
> 
> > 
> > Moreover, user can't manage by f2fs checkpoints completely, as I can understand. It is not so clear
> > what critical points can be a starting points of recovery actions. How is it possible to define how
> > many checkpoints f2fs volume will have?
> 
> IMHO, user does not need to know how many snapshots there exist and track the fs utilization all the time.
> (off list: I don't know why cleaning process should be tuned by users.)
> 
> f2fs writes two checkpoints alternatively. One is for the last stable checkpoint and another is for next checkpoint.
> So, during the recovery, f2fs starts to find one of the latest stable checkpoint.
> The stable checkpoint must have whole index structures and data consistently.
> As you knew, many things can be found in the following LFS paper.
> http://www.cs.berkeley.edu/~brewer/cs262/LFS.pdf
> 
> 
> > 
> > How many user data (metadata) can be lost in the case of sudden power off? Is it possible to estimate
> > this?
> > 
> 
> If user calls sync, f2fs via vfs writes all the data, and it writes a checkpoint.
> In that case, all the data are safe.
> After sync, several fsync can be triggered, and it occurs sudden power off.
> In that case, f2fs first performs roll-back to the last stable checkpoint among two, and then roll-forward to recover fsync'ed data only.
> So, f2fs recovers data triggered by sync or fsync only.
> 
> > >
> > >> As I understand, it is not possible to have a perfect performance in all possible workloads. Could
> > you
> > >> point out what workloads are the best way of F2FS using?
> > >
> > > Basically I think the following workloads will be good for F2FS.
> > > - Many random writes : it's LFS nature
> > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.
> > >
> > 
> > Yes, it can be so for the case of non-aged f2fs volume. But I am afraid that for the case of aged f2fs
> > volume the situation can be opposite. I think that in the case of aged state of f2fs volume the GC
> > will be under hard work in above-mentioned workloads.
> 
> Yes, you're right.
> In the LFS paper above, there are two logging schemes: threaded logging and copy-and-compaction.
> In order to avoid high cleaning overhead, f2fs adopts a hybrid one which changes the allocation policy dynamically
> between two schemes.
> Threaded logging is similar to the traditional approach, resulting in random writes without cleaning operations.
> Copy-and-compaction is another name of cleaning, resulting in sequential writes with cleaning operations.
> So, f2fs adopts one of them in runtime according to the file system status.
> Through this, we could see the random write performance comparable to ext4 even in the worst case.
> 
> > 
> > But, as I can understand, smartphones and tablets are the most promising way of f2fs using. Because
> > f2fs designs for NAND flash memory based-storage devices. So, I think that such workloads as "many
> > random writes" or "small writes with frequent fsync" are not so frequent use-cases. Use-case of
> > creation and deletion many small files can be more frequent use-case under smartphones and tablets.
> > But, as I can understand, f2fs has slightly expensive metadata payload in the case of small files
> > creation. Moreover, frequent and random deletion of small files ends in the very sophisticated and
> > unpredictable GC behavior, as I can understand.
> > 
> 
> I'd like to share the following paper.
> http://research.cs.wisc.edu/adsl/Publications/ibench-tocs12.pdf
> 
> In our experiments *also* on android phones, we've seen many random patterns with frequent fsync calls.
> We found that the main problem is database, and I think f2fs is beneficial to this.
> As you mentioned, I agree that it is important to handle many small files too.
> It is right that this may cause additional cleaning overhead, and f2fs has some metadata payload overhead.
> In order to reduce the cleaning overhead, f2fs adopts static and dynamic hot and cold data separation.
> The main goal is to split the data according to their type (e.g., dir inode, file inode, dentry data, etc) as much as possible.
> Please see the document in detail.
> I think this approach is quite effective to achieve the goal.
> BTW, the payload overhead can be resolved by adopting embedding data in the inode likewise ext4.
> I think it is also good idea, and I hope to adopt it in future.
> 

I'd like you to refer to the following link as well which is about
mobile workload pattern.
http://www.cs.cmu.edu/~fuyaoz/courses/15712/report.pdf
It's reported that in Android there are frequent issues of fsync and
most of them are only for small size of data.

To provide efficient fsync, F2FS minimizes the amount of metadata
written to serve a fsync. Fsync in F2FS is completed by writing user
data blocks and direct node blocks which point to them rather than
creating a new checkpoint which would incur more I/O loads. 
If sudden power failure happens, then F2FS recovery routine rolls back
to the latest checkpoint and thereafter recovers file system state to
reflect all the completed fsync operations, which we call roll-forward
recovery.
You may want to look at the code about the roll-forward in recover_fsync_data().

> > >>
> > >>> Maybe or obviously it is possible to optimize ext4 or btrfs to flash storages.
> > >>> IMHO, however, they are originally designed for HDDs, so that it may or may not suffer from
> > >> fundamental designs.
> > >>> I don't know, but why not designing a new file system for flash storages as a counterpart?
> > >>>
> > >>
> > >> Yes, it is possible. But F2FS is not flash oriented filesystem as JFFS2, YAFFS2, UBIFS but block-
> > >> oriented filesystem. So, F2FS design is restricted by block-layer's opportunities in the using of
> > >> flash storages' peculiarities. Could you point out key points of F2FS design that makes this design
> > >> fundamentally unique?
> > >
> > > As you can see the f2fs kernel document patch, I think one of the most important features is to
> > align operating units between f2fs and ftl.
> > > Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit
> > respectively.
> > > Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary operations
> > done by FTL.
> > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios
> > likewise ext4.
> > >
> > 
> > As I can understand, it is not so easy to create partition with f2fs volume which is aligned on
> > operating units (especially in the case of eMMC or SSD).
> 
> Could you explain why it is not so easy?
> 
> > Performance of unaligned volume can degrade
> > significantly because of FTL activity. What mechanisms has f2fs for excluding such situation and
> > achieving of the goal to reduce unnecessary FTL operations?
> 
> Could you please explain your concern more exactly?
> In the kernel doc, the start address of f2fs data structure is aligned to the segment size (i.e., 2MB).
> Do you mean that or another operating units (e.g., section and zone)?
> 
> Thanks,
> 
> > 
> > With the best regards,
> > Vyacheslav Dubeyko.
> > 
> > >>
> > >> With the best regards,
> > >> Vyacheslav Dubeyko.
> > >>
> > >>
> > >>>>
> > >>>> Marco
> > >>>
> > >>> ---
> > >>> Jaegeuk Kim
> > >>> Samsung
> > >>>
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > >>> the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>> Please read the FAQ at  http://www.tux.org/lkml/
> > >
> > >
> > > ---
> > > Jaegeuk Kim
> > > Samsung
> > >
> 
> 
> ---
> Jaegeuk Kim
> Samsung
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
Jooyoung Hwang
Samsung Electronics

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 12:01                       ` Jaegeuk Kim
@ 2012-10-09 21:20                           ` Dave Chinner
  2012-10-09 21:20                           ` Dave Chinner
  1 sibling, 0 replies; 154+ messages in thread
From: Dave Chinner @ 2012-10-09 21:20 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Lukáš Czerner', 'Namjae Jeon',
	'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

[ Folks, can you trim your responses down to just quote the part you
are responding to? Having to repeatedly scroll through 500 lines of
irrelevant text just to find the 5 lines that is being commented on
is exceedingly painful.  ]

On Tue, Oct 09, 2012 at 09:01:18PM +0900, Jaegeuk Kim wrote:
> > From: Lukáš Czerner [mailto:lczerner@redhat.com]
> > > > I am sorry but this reply makes me smile. How can you design a fs
> > > > relying on time attack heuristics to figure out what the proper
> > > > layout should be ? Or even endorse such heuristics to be used in
> > > > mkfs ? What we should be focusing on is to push vendors to actually
> > > > give us such information so we can properly propagate that
> > > > throughout the kernel - that's something everyone will benefit from.
> > > > After that the optimization can be done in every file system.
> > > >
> > >
> > > Frankly speaking, I agree that it would be the right direction eventually.
> > > But, as you know, it's very difficult for all flash vendors to promote and standardize that.
> > > Because each vendors have different strategies to open their internal information and also try
> > > to protect their secrets whatever they are.
> > >
> > > IMO, we don't need to wait them now.
> > > Instead, from the start, I suggest f2fs that uses those information to the file system design.
> > > In addition, I suggest using heuristics right now as best efforts.

And in response, other people are "suggesting" that this is the
wrong approach.

> > > Maybe in future, if vendors give something, f2fs would be more feasible.
> > > In the mean time, I strongly hope to validate and stabilize f2fs with community.
> > 
> > Do not get me wrong, I do not think it is worth to wait for vendors
> > to come to their senses, but it is worth constantly reminding that
> > we *need* this kind of information and those heuristics are not
> > feasible in the long run anyway.
> > 
> > I believe that this conversation happened several times already, but
> > what about having independent public database of all the internal
> > information about hw from different vendors where users can add
> > information gathered by the time attack heuristic so other does not
> > have to run this again and again. I am not sure if Linaro or someone
> > else have something like that, someone can maybe post a link to that.

Linaro already have one, which is another reason why using
heuristics is the wrong approach:

https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey?action=show&redirect=WorkingGroups%2FKernelConsolidation%2FProjects%2FFlashCardSurvey

> As I mentioned, I agree to push vendors to open those information all the time.
> And, I absolutely didn't mean that it is worth to wait vendors.
> I meant, until opening those information by vendors, something like
> proposing f2fs or gathering heuristics are also needed simultaneously.
> 
> Anyway, it's very interesting to build a database gathering products' information.
> May I access the database?

It's public information.

If you want to support different types of flash, then either add
your timing attack derived information on specific hardware to the
above table, or force vendors to update it themselves if they want
their flash memory supported by this filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-09 21:20                           ` Dave Chinner
  0 siblings, 0 replies; 154+ messages in thread
From: Dave Chinner @ 2012-10-09 21:20 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Lukáš Czerner', 'Namjae Jeon',
	'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

[ Folks, can you trim your responses down to just quote the part you
are responding to? Having to repeatedly scroll through 500 lines of
irrelevant text just to find the 5 lines that is being commented on
is exceedingly painful.  ]

On Tue, Oct 09, 2012 at 09:01:18PM +0900, Jaegeuk Kim wrote:
> > From: Lukáš Czerner [mailto:lczerner@redhat.com]
> > > > I am sorry but this reply makes me smile. How can you design a fs
> > > > relying on time attack heuristics to figure out what the proper
> > > > layout should be ? Or even endorse such heuristics to be used in
> > > > mkfs ? What we should be focusing on is to push vendors to actually
> > > > give us such information so we can properly propagate that
> > > > throughout the kernel - that's something everyone will benefit from.
> > > > After that the optimization can be done in every file system.
> > > >
> > >
> > > Frankly speaking, I agree that it would be the right direction eventually.
> > > But, as you know, it's very difficult for all flash vendors to promote and standardize that.
> > > Because each vendors have different strategies to open their internal information and also try
> > > to protect their secrets whatever they are.
> > >
> > > IMO, we don't need to wait them now.
> > > Instead, from the start, I suggest f2fs that uses those information to the file system design.
> > > In addition, I suggest using heuristics right now as best efforts.

And in response, other people are "suggesting" that this is the
wrong approach.

> > > Maybe in future, if vendors give something, f2fs would be more feasible.
> > > In the mean time, I strongly hope to validate and stabilize f2fs with community.
> > 
> > Do not get me wrong, I do not think it is worth to wait for vendors
> > to come to their senses, but it is worth constantly reminding that
> > we *need* this kind of information and those heuristics are not
> > feasible in the long run anyway.
> > 
> > I believe that this conversation happened several times already, but
> > what about having independent public database of all the internal
> > information about hw from different vendors where users can add
> > information gathered by the time attack heuristic so other does not
> > have to run this again and again. I am not sure if Linaro or someone
> > else have something like that, someone can maybe post a link to that.

Linaro already have one, which is another reason why using
heuristics is the wrong approach:

https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey?action=show&redirect=WorkingGroups%2FKernelConsolidation%2FProjects%2FFlashCardSurvey

> As I mentioned, I agree to push vendors to open those information all the time.
> And, I absolutely didn't mean that it is worth to wait vendors.
> I meant, until opening those information by vendors, something like
> proposing f2fs or gathering heuristics are also needed simultaneously.
> 
> Anyway, it's very interesting to build a database gathering products' information.
> May I access the database?

It's public information.

If you want to support different types of flash, then either add
your timing attack derived information on specific hardware to the
above table, or force vendors to update it themselves if they want
their flash memory supported by this filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 21:20                           ` Dave Chinner
  (?)
@ 2012-10-10  2:32                           ` Jaegeuk Kim
  -1 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-10  2:32 UTC (permalink / raw)
  To: 'Dave Chinner'
  Cc: 'Lukáš Czerner', 'Namjae Jeon',
	'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of
> Dave Chinner
> Sent: Wednesday, October 10, 2012 6:20 AM
> To: Jaegeuk Kim
> Cc: 'Lukáš Czerner'; 'Namjae Jeon'; 'Vyacheslav Dubeyko'; 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro';
> tytso@mit.edu; gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org; chur.lee@samsung.com;
> cm224.lee@samsung.com; jooyoung.hwang@samsung.com; linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> [ Folks, can you trim your responses down to just quote the part you
> are responding to? Having to repeatedly scroll through 500 lines of
> irrelevant text just to find the 5 lines that is being commented on
> is exceedingly painful.  ]

Ok, I'll keep in mind.
Thanks.

> 
> On Tue, Oct 09, 2012 at 09:01:18PM +0900, Jaegeuk Kim wrote:
> > > From: Lukáš Czerner [mailto:lczerner@redhat.com]
> > > > > I am sorry but this reply makes me smile. How can you design a fs
> > > > > relying on time attack heuristics to figure out what the proper
> > > > > layout should be ? Or even endorse such heuristics to be used in
> > > > > mkfs ? What we should be focusing on is to push vendors to actually
> > > > > give us such information so we can properly propagate that
> > > > > throughout the kernel - that's something everyone will benefit from.
> > > > > After that the optimization can be done in every file system.
> > > > >
> > > >
> > > > Frankly speaking, I agree that it would be the right direction eventually.
> > > > But, as you know, it's very difficult for all flash vendors to promote and standardize that.
> > > > Because each vendors have different strategies to open their internal information and also try
> > > > to protect their secrets whatever they are.
> > > >
> > > > IMO, we don't need to wait them now.
> > > > Instead, from the start, I suggest f2fs that uses those information to the file system design.
> > > > In addition, I suggest using heuristics right now as best efforts.
> 
> And in response, other people are "suggesting" that this is the
> wrong approach.

Ok, it makes sense.
I agree that the Linaro survey has been well proceeded, and no more heuristic is needed.

> 
> > > > Maybe in future, if vendors give something, f2fs would be more feasible.
> > > > In the mean time, I strongly hope to validate and stabilize f2fs with community.
> > >
> > > Do not get me wrong, I do not think it is worth to wait for vendors
> > > to come to their senses, but it is worth constantly reminding that
> > > we *need* this kind of information and those heuristics are not
> > > feasible in the long run anyway.
> > >
> > > I believe that this conversation happened several times already, but
> > > what about having independent public database of all the internal
> > > information about hw from different vendors where users can add
> > > information gathered by the time attack heuristic so other does not
> > > have to run this again and again. I am not sure if Linaro or someone
> > > else have something like that, someone can maybe post a link to that.
> 
> Linaro already have one, which is another reason why using
> heuristics is the wrong approach:
> 
> https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey?action=show&redirect=WorkingGrou
> ps%2FKernelConsolidation%2FProjects%2FFlashCardSurvey
> 
> > As I mentioned, I agree to push vendors to open those information all the time.
> > And, I absolutely didn't mean that it is worth to wait vendors.
> > I meant, until opening those information by vendors, something like
> > proposing f2fs or gathering heuristics are also needed simultaneously.
> >
> > Anyway, it's very interesting to build a database gathering products' information.
> > May I access the database?
> 
> It's public information.
> 
> If you want to support different types of flash, then either add
> your timing attack derived information on specific hardware to the
> above table, or force vendors to update it themselves if they want
> their flash memory supported by this filesystem.

Sound good.
If I also get something, I'll try.
Thank you.
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 11:01                     ` Lukáš Czerner
@ 2012-10-10  4:53                         ` Theodore Ts'o
  2012-10-10  4:53                         ` Theodore Ts'o
  1 sibling, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-10  4:53 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Jaegeuk Kim, 'Namjae Jeon', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

On Tue, Oct 09, 2012 at 01:01:24PM +0200, Lukáš Czerner wrote:
> Do not get me wrong, I do not think it is worth to wait for vendors
> to come to their senses, but it is worth constantly reminding that
> we *need* this kind of information and those heuristics are not
> feasible in the long run anyway.

A number of us has been telling flash vendors exactly this.  The
technical people do seem to understand.  It's management who seem to
be primarily clueless, even though this information can be extracted
by employing timing attacks on the media.  I've pointed this out
before, and the technical people agree that trying to keep this
information as a "trade secret" is pointless, stupid, and
counterproductive.  Trying to get the pointy-haired bosses to
understand may take quite a while.

That being said, in many cases, it doesn't really matter.  For
example, if a manufacturer has a production run of a million Android
mobile devices, (a) all of the eMMC devices will be the same (or at
least come from a handful of suppliers in the worst case), and (b) the
menufacturers *will* be able to get this information under NDA, and so
they can just feed it straight to the mkfs program.  There's no need
in many cases to have mkfs burn write cycles carrying out a timing
attack on which flash device that it is formatting.


My concern is a different one.  We shouldn't just be focusing on
sqlite performance assuming that its characteristics are fixed, to the
point where it drives file system design and benchmarking.  Currently
sqllite does a lot of pointless writes at every single transaction
boundary which could be optimized if you relax the design constraint
that the database has to be in a single file --- something which is a
nice-to-have for some applications, but which really doesn't matter in
an embedded/mobile handset use case.

It may very well be that f2fs is still going to be better since it is
trying to minimize the number of erase blocks that are "open" for
writing at one time.  And even if eMMC devices become more
intelligent, optimizing for erase blocks is still a good thing
(although it may not result in as spectacular wins on flash devices
with more sophisticated FTL's.).

However, it may also be that we'll be able to teach some existing file
systme how to be more intelligent about optimizing for erase blocks
that could be made production stable faster.  (I have some ideas of
how to do this for ext4.)

But the point I'm trying to drive home here is that we shouldn't
assume that the only thing we can do is do optimize the file system.
Given the amount of time it takes to test, performance tune, and
confidence that the file system is sound and stable (look at how long
btrfs has taken to mature), it is likely that both flash technology
and workload characteristics will change before f2fs is fully mature
--- and this is no slight on the good work Jaegeuk and his team have
done.

Long experience with file systems show us that they are like fine
wine; they take time to mature.  Whether you're talking about
ext2/3/4, btrfs, Sun's ZFS, Digital's ADVFS, IBM's JFS or GPFS etc.,
and whether you're talking about file systems developed using open
source or more traditional corporate development processes, it takes a
minimum of 3-5 years and 50-200 PY's of effort to create a fully
production-ready file system from scratch (and some of the people
which I surveyed for the Nxxt Generation File System task force, some
of which had decades of experience creating and working with file
systems, thought the 50-75 Person-Year estimate was a lowball --- note
that Sun's ZFS took *seven* years to develop, even with a generously
staffed team.)

As an open source example, the NGFS system task force, decided to
claim, in its November 2007 report-out, that btrfs would be ready for
community distro's in two years, since otherwise the managers and
other folks who control corporate budgets at the companies involved
would be scared off and decide not to fund the project.  And yet here
we are in 2012, five years later, and we're just starting to see btrfs
support show up in community distro's as a supported option, and I
don't think most people would claim it is ready for production use in
enterprise distro's yet.

Given that, we might as well make sure we can do what we can to
optimize performance up and down the storage stack --- not just at the
file system level, but also by optimizing sqlite for embedded/handset
use cases.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-10  4:53                         ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-10  4:53 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Jaegeuk Kim, 'Namjae Jeon', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

On Tue, Oct 09, 2012 at 01:01:24PM +0200, Lukáš Czerner wrote:
> Do not get me wrong, I do not think it is worth to wait for vendors
> to come to their senses, but it is worth constantly reminding that
> we *need* this kind of information and those heuristics are not
> feasible in the long run anyway.

A number of us has been telling flash vendors exactly this.  The
technical people do seem to understand.  It's management who seem to
be primarily clueless, even though this information can be extracted
by employing timing attacks on the media.  I've pointed this out
before, and the technical people agree that trying to keep this
information as a "trade secret" is pointless, stupid, and
counterproductive.  Trying to get the pointy-haired bosses to
understand may take quite a while.

That being said, in many cases, it doesn't really matter.  For
example, if a manufacturer has a production run of a million Android
mobile devices, (a) all of the eMMC devices will be the same (or at
least come from a handful of suppliers in the worst case), and (b) the
menufacturers *will* be able to get this information under NDA, and so
they can just feed it straight to the mkfs program.  There's no need
in many cases to have mkfs burn write cycles carrying out a timing
attack on which flash device that it is formatting.


My concern is a different one.  We shouldn't just be focusing on
sqlite performance assuming that its characteristics are fixed, to the
point where it drives file system design and benchmarking.  Currently
sqllite does a lot of pointless writes at every single transaction
boundary which could be optimized if you relax the design constraint
that the database has to be in a single file --- something which is a
nice-to-have for some applications, but which really doesn't matter in
an embedded/mobile handset use case.

It may very well be that f2fs is still going to be better since it is
trying to minimize the number of erase blocks that are "open" for
writing at one time.  And even if eMMC devices become more
intelligent, optimizing for erase blocks is still a good thing
(although it may not result in as spectacular wins on flash devices
with more sophisticated FTL's.).

However, it may also be that we'll be able to teach some existing file
systme how to be more intelligent about optimizing for erase blocks
that could be made production stable faster.  (I have some ideas of
how to do this for ext4.)

But the point I'm trying to drive home here is that we shouldn't
assume that the only thing we can do is do optimize the file system.
Given the amount of time it takes to test, performance tune, and
confidence that the file system is sound and stable (look at how long
btrfs has taken to mature), it is likely that both flash technology
and workload characteristics will change before f2fs is fully mature
--- and this is no slight on the good work Jaegeuk and his team have
done.

Long experience with file systems show us that they are like fine
wine; they take time to mature.  Whether you're talking about
ext2/3/4, btrfs, Sun's ZFS, Digital's ADVFS, IBM's JFS or GPFS etc.,
and whether you're talking about file systems developed using open
source or more traditional corporate development processes, it takes a
minimum of 3-5 years and 50-200 PY's of effort to create a fully
production-ready file system from scratch (and some of the people
which I surveyed for the Nxxt Generation File System task force, some
of which had decades of experience creating and working with file
systems, thought the 50-75 Person-Year estimate was a lowball --- note
that Sun's ZFS took *seven* years to develop, even with a generously
staffed team.)

As an open source example, the NGFS system task force, decided to
claim, in its November 2007 report-out, that btrfs would be ready for
community distro's in two years, since otherwise the managers and
other folks who control corporate budgets at the companies involved
would be scared off and decide not to fund the project.  And yet here
we are in 2012, five years later, and we're just starting to see btrfs
support show up in community distro's as a supported option, and I
don't think most people would claim it is ready for production use in
enterprise distro's yet.

Given that, we might as well make sure we can do what we can to
optimize performance up and down the storage stack --- not just at the
file system level, but also by optimizing sqlite for embedded/handset
use cases.

Regards,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09  7:08                 ` Jaegeuk Kim
  (?)
  (?)
@ 2012-10-10  7:57                 ` Vyacheslav Dubeyko
  2012-10-10  9:43                   ` Jaegeuk Kim
  -1 siblings, 1 reply; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-10  7:57 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

On Tue, 2012-10-09 at 16:08 +0900, Jaegeuk Kim wrote:
> > -----Original Message-----
> > From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > Sent: Tuesday, October 09, 2012 4:23 AM
> > To: Jaegeuk Kim
> > Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> > kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> > linux-fsdevel@vger.kernel.org
> > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system

> > >>> NILFS2 is one of major log-structured file systems, which supports multiple snap-shots.
> > >>> IMO, that feature is quite promising and important to users, but it may degrade the performance.
> > >>> There is a trade-off between functionalities and performance.
> > >>> F2FS chose high performance without any further fancy functionalities.
> > >>>
> > >>
> > >> Performance is a good goal. But fault-tolerance is also very important point. Filesystems are used
> > by
> > >> users, so, it is very important to guarantee reliability of data keeping. Degradation of
> > performance
> > >> by means of snapshots is arguable point. Snapshots can solve the problem not only some
> > unpredictable
> > >> environmental issues but also user's erroneous behavior.
> > >>
> > >
> > > Yes, I agree. I concerned the multiple snapshot feature.
> > > Of course, fault-tolerance is very important, and file system should support it as you know as
> > power-off-recovery.
> > > f2fs supports the recovery mechanism by adopting checkpoint similar to snapshot.
> > > But, f2fs does not support multiple snapshots for user convenience.
> > > I just focused on the performance, and absolutely, the multiple snapshot feature is also a good
> > alternative approach.
> > > That may be a trade-off.
> > 
> > So, maybe I misunderstand something, but I can't understand the difference. As I know, snapshot in
> > NILFS2 is a checkpoint converted by user in snapshot. So, NILFS2's checkpoint is a log that adds new
> > file system's state changing (user data + metadata). In other words, checkpoint is mechanism of
> > writing on volume. Moreover, NILFS2 gives flexible way of checkpoint/snapshot management.
> > 
> > As you are saying, f2fs supports checkpoints also. It means for me that checkpoints are the basic
> > mechanism of writing operations on f2fs. But, about what performance gain and difference do you talk?
> 
> How about the following scenario?
> 1. data "a" is newly written.
> 2. checkpoint "A" is done.
> 3. data "a" is truncated.
> 4. checkpoint "B" is done.
> 
> If fs supports multiple snapshots like "A" and "B" to users, it cannot reuse the space allocated by
> data "a" after checkpoint "B" even though data "a" is safely truncated by checkpoint "B".
> This is because fs should keep data "a" to prepare a roll-back to "A".
> So, even though user sees some free space, LFS may suffer from cleaning due to the exhausted free space.
> If users want to avoid this, they have to remove snapshots by themselves. Or, maybe automatically?
> 

I feel that here it exists some misunderstanding in checkpoint/snapshot terminology (especially, for the NILFS2 case). It is possible that NILFS2 volume can contain only checkpoints (if user doesn't created any snapshot). You are right, snapshot cannot be deleted because, in other word, user marked this file system state as important point. But checkpoints can be reclaimed easily. I can't see any problem to reclaim free space from checkpoints in above-mentioned scenario in the case of NILFS2. But if a user decides to make a snapshot then it is a law.  

So, from my point of view, f2fs volume contains only checkpoints without possibility freeze some of it as snapshot. The f2fs volume contains checkpoints also but user can't touch it in some way.

As I know, NILFS2 has Garbage Collector that removes checkpoints automatically in background. But it is possible also to force removing as checkpoints as snapshots by hands with special utility using. As I can understand, f2fs has Garbage Collector also that reclaims free space of dirty checkpoints. So, what is the difference? I have such opinion that difference is in lack of easy manipulation by checkpoints in the case of f2fs.

> > 
> > Moreover, user can't manage by f2fs checkpoints completely, as I can understand. It is not so clear
> > what critical points can be a starting points of recovery actions. How is it possible to define how
> > many checkpoints f2fs volume will have?
> 
> IMHO, user does not need to know how many snapshots there exist and track the fs utilization all the time.
> (off list: I don't know why cleaning process should be tuned by users.)
> 

What do you plan to do in the case of users' complains about issues with free space reclaiming? If user doesn't know about checkpoints and haven't any tools for accessing to checkpoints then how is it possible to investigate issues with free space reclaiming on an user side?

> f2fs writes two checkpoints alternatively. One is for the last stable checkpoint and another is for next checkpoint.
> So, during the recovery, f2fs starts to find one of the latest stable checkpoint.
> The stable checkpoint must have whole index structures and data consistently.
> As you knew, many things can be found in the following LFS paper.
> http://www.cs.berkeley.edu/~brewer/cs262/LFS.pdf
> 
> 
> > 
> > How many user data (metadata) can be lost in the case of sudden power off? Is it possible to estimate
> > this?
> > 
> 
> If user calls sync, f2fs via vfs writes all the data, and it writes a checkpoint.
> In that case, all the data are safe.
> After sync, several fsync can be triggered, and it occurs sudden power off.
> In that case, f2fs first performs roll-back to the last stable checkpoint among two, and then roll-forward to recover fsync'ed data only.
> So, f2fs recovers data triggered by sync or fsync only.
> 

So, as I can understand, f2fs can be recovered by driver in the case of validity of one from two checkpoints. Sudden power-off can occur anytime. How high probability to achieve unrecoverable by driver state of f2fs during sudden power-off? Is it possible to recover f2fs in such case by fsck, for example?  

> > >
> > >> As I understand, it is not possible to have a perfect performance in all possible workloads. Could
> > you
> > >> point out what workloads are the best way of F2FS using?
> > >
> > > Basically I think the following workloads will be good for F2FS.
> > > - Many random writes : it's LFS nature
> > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.
> > >
> > 
> > Yes, it can be so for the case of non-aged f2fs volume. But I am afraid that for the case of aged f2fs
> > volume the situation can be opposite. I think that in the case of aged state of f2fs volume the GC
> > will be under hard work in above-mentioned workloads.
> 
> Yes, you're right.
> In the LFS paper above, there are two logging schemes: threaded logging and copy-and-compaction.
> In order to avoid high cleaning overhead, f2fs adopts a hybrid one which changes the allocation policy dynamically
> between two schemes.
> Threaded logging is similar to the traditional approach, resulting in random writes without cleaning operations.
> Copy-and-compaction is another name of cleaning, resulting in sequential writes with cleaning operations.
> So, f2fs adopts one of them in runtime according to the file system status.
> Through this, we could see the random write performance comparable to ext4 even in the worst case.
> 

As I can understand, the goal of f2fs is to be a flash-friendly file system by means of reducing unnecessary FTL operations. This goal is achieving by means of alignment on operation unit and copy-on-write policy, from my understanding. So, I think that write operations without cleaning can be resulted in additional FTL operations. 

> > 
> > But, as I can understand, smartphones and tablets are the most promising way of f2fs using. Because
> > f2fs designs for NAND flash memory based-storage devices. So, I think that such workloads as "many
> > random writes" or "small writes with frequent fsync" are not so frequent use-cases. Use-case of
> > creation and deletion many small files can be more frequent use-case under smartphones and tablets.
> > But, as I can understand, f2fs has slightly expensive metadata payload in the case of small files
> > creation. Moreover, frequent and random deletion of small files ends in the very sophisticated and
> > unpredictable GC behavior, as I can understand.
> > 
> 
> I'd like to share the following paper.
> http://research.cs.wisc.edu/adsl/Publications/ibench-tocs12.pdf
> 

Excellent paper. Thank you. 

> In our experiments *also* on android phones, we've seen many random patterns with frequent fsync calls.
> We found that the main problem is database, and I think f2fs is beneficial to this.

I think that database is not main use-case on Android phones. The dominating use-case can be operation by multimedia information and operations with small files, from my point of view.

So, it is possible to extract such key points from the shared paper: (1) file has complex structure; (2) sequential access is not sequential; (3) auxiliary files dominate; (4) multiple threads perform I/O.

I am afraid that random modification of different part of files and I/O operations from multiple threads can lead to significant fragmentation as file fragments as directory meta-information because of garbage collection.

I think that Iozone can be not fully proper benchmarking suite for file system performance estimation in such case. Maybe it needs to use special synthetic benchmarking tool.

> As you mentioned, I agree that it is important to handle many small files too.
> It is right that this may cause additional cleaning overhead, and f2fs has some metadata payload overhead.
> In order to reduce the cleaning overhead, f2fs adopts static and dynamic hot and cold data separation.
> The main goal is to split the data according to their type (e.g., dir inode, file inode, dentry data, etc) as much as possible.
> Please see the document in detail.
> I think this approach is quite effective to achieve the goal.
> BTW, the payload overhead can be resolved by adopting embedding data in the inode likewise ext4.
> I think it is also good idea, and I hope to adopt it in future.
> 

As I can understand, f2fs uses old-fashioned (ext2/ext3 likewise) block-mapping scheme. This approach have significant metadata and performance payload. Extent approach can be more promising approach. But I am afraid that extent approach contradicts to f2fs internal techniques (Garbage Collector technique). So, it will be very hard to adopt extent approach in f2fs, from my point of view.


> > >
> > > As you can see the f2fs kernel document patch, I think one of the most important features is to
> > align operating units between f2fs and ftl.
> > > Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit
> > respectively.
> > > Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary operations
> > done by FTL.
> > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios
> > likewise ext4.
> > >
> > 
> > As I can understand, it is not so easy to create partition with f2fs volume which is aligned on
> > operating units (especially in the case of eMMC or SSD).
> 
> Could you explain why it is not so easy?
> 
> > Performance of unaligned volume can degrade
> > significantly because of FTL activity. What mechanisms has f2fs for excluding such situation and
> > achieving of the goal to reduce unnecessary FTL operations?
> 
> Could you please explain your concern more exactly?
> In the kernel doc, the start address of f2fs data structure is aligned to the segment size (i.e., 2MB).
> Do you mean that or another operating units (e.g., section and zone)?
> 

I mean that every volume is placed inside any partition (MTD or GPT). Every partition begins from any physical sector. So, as I can understand, f2fs volume can begin from physical sector that is laid inside physical erase block. Thereby, in such case of formating the f2fs's operation units will be unaligned in relation of physical erase blocks, from my point of view. Maybe, I misunderstand something but it can lead to additional FTL operations and performance degradation, from my point of view.

With the best regards,
Vyacheslav Dubeyko.

> Thanks,
> 
> > 
> > With the best regards,
> > Vyacheslav Dubeyko.
> > 
> > >>
> > >> With the best regards,
> > >> Vyacheslav Dubeyko.
> > >>
> > >>
> > >>>>
> > >>>> Marco
> > >>>
> > >>> ---
> > >>> Jaegeuk Kim
> > >>> Samsung
> > >>>
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > >>> the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>> Please read the FAQ at  http://www.tux.org/lkml/
> > >
> > >
> > > ---
> > > Jaegeuk Kim
> > > Samsung
> > >
> 
> 
> ---
> Jaegeuk Kim
> Samsung
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 19:53                   ` Jooyoung Hwang
  (?)
@ 2012-10-10  8:05                   ` Vyacheslav Dubeyko
  -1 siblings, 0 replies; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-10  8:05 UTC (permalink / raw)
  To: Jooyoung Hwang
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, linux-fsdevel

On Tue, 2012-10-09 at 14:53 -0500, Jooyoung Hwang wrote:
> On Tue, 2012-10-09 at 16:08 +0900, Jaegeuk Kim wrote:
> > > -----Original Message-----
> > > From: Vyacheslav Dubeyko [mailto:slava@dubeyko.com]
> > > Sent: Tuesday, October 09, 2012 4:23 AM
> > > To: Jaegeuk Kim
> > > Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; tytso@mit.edu; gregkh@linuxfoundation.org; linux-
> > > kernel@vger.kernel.org; chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> > > linux-fsdevel@vger.kernel.org
> > > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system

> 
> I'd like you to refer to the following link as well which is about
> mobile workload pattern.
> http://www.cs.cmu.edu/~fuyaoz/courses/15712/report.pdf
> It's reported that in Android there are frequent issues of fsync and
> most of them are only for small size of data.
> 
> To provide efficient fsync, F2FS minimizes the amount of metadata
> written to serve a fsync. Fsync in F2FS is completed by writing user
> data blocks and direct node blocks which point to them rather than
> creating a new checkpoint which would incur more I/O loads. 
> If sudden power failure happens, then F2FS recovery routine rolls back
> to the latest checkpoint and thereafter recovers file system state to
> reflect all the completed fsync operations, which we call roll-forward
> recovery.
> You may want to look at the code about the roll-forward in recover_fsync_data().
> 

Thank you.

With the best regards,
Vyacheslav Dubeyko.

> --
> Jooyoung Hwang
> Samsung Electronics
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09 19:53                   ` Jooyoung Hwang
  (?)
  (?)
@ 2012-10-10  9:02                   ` Theodore Ts'o
  2012-10-10 11:52                     ` SQLite on flash (was: [PATCH 00/16] f2fs: introduce flash-friendly file system) Clemens Ladisch
  -1 siblings, 1 reply; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-10  9:02 UTC (permalink / raw)
  To: Jooyoung Hwang
  Cc: 'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	gregkh, linux-kernel, chur.lee, cm224.lee, linux-fsdevel

On Tue, Oct 09, 2012 at 02:53:26PM -0500, Jooyoung Hwang wrote:

> I'd like you to refer to the following link as well which is about
> mobile workload pattern.
> http://www.cs.cmu.edu/~fuyaoz/courses/15712/report.pdf
> It's reported that in Android there are frequent issues of fsync and
> most of them are only for small size of data.

What bothers me is no one is asking the question, *why* is Android
(and more specifically SQLite and the applications which call SQLite)
using fsync's so often?  These aren't transaction processing systems,
after all.  So there are two questions that are worth asking here.
(a) Is SQLite being as flash-friendly as possible, and (b) do the
applications really need as many transaction boundaries as they are
requesting of SQLite.

Yes, we can optimize the file system, but sometimes the best way to
optimize a write is to not to do the write at all (if it is not
required for the application's functionality, of course).  If the
application is requesting 4 transaction boundaries where only one is
required, we can try to make fsync's more efficient, yes --- but there
is only so much that can be done at the fs layer.

							- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-10  7:57                 ` [PATCH 00/16] f2fs: introduce flash-friendly file system Vyacheslav Dubeyko
@ 2012-10-10  9:43                   ` Jaegeuk Kim
  2012-10-11  3:14                     ` Namjae Jeon
  2012-10-12 12:30                     ` Vyacheslav Dubeyko
  0 siblings, 2 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-10  9:43 UTC (permalink / raw)
  To: 'Vyacheslav Dubeyko'
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

[snip]
> > How about the following scenario?
> > 1. data "a" is newly written.
> > 2. checkpoint "A" is done.
> > 3. data "a" is truncated.
> > 4. checkpoint "B" is done.
> >
> > If fs supports multiple snapshots like "A" and "B" to users, it cannot reuse the space allocated by
> > data "a" after checkpoint "B" even though data "a" is safely truncated by checkpoint "B".
> > This is because fs should keep data "a" to prepare a roll-back to "A".
> > So, even though user sees some free space, LFS may suffer from cleaning due to the exhausted free
> space.
> > If users want to avoid this, they have to remove snapshots by themselves. Or, maybe automatically?
> >
> 
> I feel that here it exists some misunderstanding in checkpoint/snapshot terminology (especially, for
> the NILFS2 case). It is possible that NILFS2 volume can contain only checkpoints (if user doesn't
> created any snapshot). You are right, snapshot cannot be deleted because, in other word, user marked
> this file system state as important point. But checkpoints can be reclaimed easily. I can't see any
> problem to reclaim free space from checkpoints in above-mentioned scenario in the case of NILFS2. But

I meant that snapshot does checkpoint.
And, the problem is related to real file system utilization managed by NILFS2.
                     [fs utilization to users]   [fs utilization managed by NILFS2]
                                X - 1                       X - 1
1. new data "a"            X                            X
2. snapshot "A"            X                            X
3. truncate "a"            X - 1                       X
4. snapshot "B"            X - 1                       X

After this, user can see X-1, but the performance will be affected by X.
Until the snapshot "A" is removed, user will experience the performance determined by X.
Do I misunderstand?

> if a user decides to make a snapshot then it is a law.
> 

I don't believe users can do all the things perfectly.

> So, from my point of view, f2fs volume contains only checkpoints without possibility freeze some of it
> as snapshot. The f2fs volume contains checkpoints also but user can't touch it in some way.
> 

Right.

> As I know, NILFS2 has Garbage Collector that removes checkpoints automatically in background. But it
> is possible also to force removing as checkpoints as snapshots by hands with special utility using. As

If users may not want to remove the snapshots automatically, should they configure not to do this too?

> I can understand, f2fs has Garbage Collector also that reclaims free space of dirty checkpoints. So,
> what is the difference? I have such opinion that difference is in lack of easy manipulation by
> checkpoints in the case of f2fs.

The problem that I concerned was performance degradation due to the real utilization available to the file system.

> 
> > >
> > > Moreover, user can't manage by f2fs checkpoints completely, as I can understand. It is not so
> clear
> > > what critical points can be a starting points of recovery actions. How is it possible to define
> how
> > > many checkpoints f2fs volume will have?
> >
> > IMHO, user does not need to know how many snapshots there exist and track the fs utilization all the
> time.
> > (off list: I don't know why cleaning process should be tuned by users.)
> >
> 
> What do you plan to do in the case of users' complains about issues with free space reclaiming? If
> user doesn't know about checkpoints and haven't any tools for accessing to checkpoints then how is it
> possible to investigate issues with free space reclaiming on an user side?

Could you explain why reclaiming free space is an issue?
IMHO, that issue is caused by adopting multiple snapshots.

[snip]

> 
> So, as I can understand, f2fs can be recovered by driver in the case of validity of one from two
> checkpoints. Sudden power-off can occur anytime. How high probability to achieve unrecoverable by
> driver state of f2fs during sudden power-off? Is it possible to recover f2fs in such case by fsck, for

In order to avoid that case, f2fs minimizes data writes and carefully overwrites some of them during roll-forward.

> example?
> 
> > > >
> > > >> As I understand, it is not possible to have a perfect performance in all possible workloads.
> Could
> > > you
> > > >> point out what workloads are the best way of F2FS using?
> > > >
> > > > Basically I think the following workloads will be good for F2FS.
> > > > - Many random writes : it's LFS nature
> > > > - Small writes with frequent fsync : f2fs is optimized to reduce the fsync overhead.
> > > >
> > >
> > > Yes, it can be so for the case of non-aged f2fs volume. But I am afraid that for the case of aged
> f2fs
> > > volume the situation can be opposite. I think that in the case of aged state of f2fs volume the GC
> > > will be under hard work in above-mentioned workloads.
> >
> > Yes, you're right.
> > In the LFS paper above, there are two logging schemes: threaded logging and copy-and-compaction.
> > In order to avoid high cleaning overhead, f2fs adopts a hybrid one which changes the allocation
> policy dynamically
> > between two schemes.
> > Threaded logging is similar to the traditional approach, resulting in random writes without cleaning
> operations.
> > Copy-and-compaction is another name of cleaning, resulting in sequential writes with cleaning
> operations.
> > So, f2fs adopts one of them in runtime according to the file system status.
> > Through this, we could see the random write performance comparable to ext4 even in the worst case.
> >
> 
> As I can understand, the goal of f2fs is to be a flash-friendly file system by means of reducing
> unnecessary FTL operations. This goal is achieving by means of alignment on operation unit and copy-
> on-write policy, from my understanding. So, I think that write operations without cleaning can be
> resulted in additional FTL operations.

Yes, but try to minimize them.

[snip]

> > In our experiments *also* on android phones, we've seen many random patterns with frequent fsync
> calls.
> > We found that the main problem is database, and I think f2fs is beneficial to this.
> 
> I think that database is not main use-case on Android phones. The dominating use-case can be operation
> by multimedia information and operations with small files, from my point of view.
> 
> So, it is possible to extract such key points from the shared paper: (1) file has complex structure;
> (2) sequential access is not sequential; (3) auxiliary files dominate; (4) multiple threads perform
> I/O.
> 
> I am afraid that random modification of different part of files and I/O operations from multiple
> threads can lead to significant fragmentation as file fragments as directory meta-information because
> of garbage collection.

Could you explain in more detail?

> 
> I think that Iozone can be not fully proper benchmarking suite for file system performance estimation
> in such case. Maybe it needs to use special synthetic benchmarking tool.
> 

Yes, it needs.

> > As you mentioned, I agree that it is important to handle many small files too.
> > It is right that this may cause additional cleaning overhead, and f2fs has some metadata payload
> overhead.
> > In order to reduce the cleaning overhead, f2fs adopts static and dynamic hot and cold data
> separation.
> > The main goal is to split the data according to their type (e.g., dir inode, file inode, dentry data,
> etc) as much as possible.
> > Please see the document in detail.
> > I think this approach is quite effective to achieve the goal.
> > BTW, the payload overhead can be resolved by adopting embedding data in the inode likewise ext4.
> > I think it is also good idea, and I hope to adopt it in future.
> >
> 
> As I can understand, f2fs uses old-fashioned (ext2/ext3 likewise) block-mapping scheme. This approach
> have significant metadata and performance payload. Extent approach can be more promising approach. But
> I am afraid that extent approach contradicts to f2fs internal techniques (Garbage Collector technique).
> So, it will be very hard to adopt extent approach in f2fs, from my point of view.
> 

Right, so f2fs adopts an extent cache for better read performance.

> 
> > > >
> > > > As you can see the f2fs kernel document patch, I think one of the most important features is to
> > > align operating units between f2fs and ftl.
> > > > Specifically, f2fs has section and zone, which are cleaning unit and basic allocation unit
> > > respectively.
> > > > Through these configurable units in f2fs, I think f2fs is able to reduce the unnecessary
> operations
> > > done by FTL.
> > > > And, in order to avoid changing IO patterns by the block-layer, f2fs merges itself some bios
> > > likewise ext4.
> > > >
> > >
> > > As I can understand, it is not so easy to create partition with f2fs volume which is aligned on
> > > operating units (especially in the case of eMMC or SSD).
> >
> > Could you explain why it is not so easy?
> >
> > > Performance of unaligned volume can degrade
> > > significantly because of FTL activity. What mechanisms has f2fs for excluding such situation and
> > > achieving of the goal to reduce unnecessary FTL operations?
> >
> > Could you please explain your concern more exactly?
> > In the kernel doc, the start address of f2fs data structure is aligned to the segment size (i.e.,
> 2MB).
> > Do you mean that or another operating units (e.g., section and zone)?
> >
> 
> I mean that every volume is placed inside any partition (MTD or GPT). Every partition begins from any
> physical sector. So, as I can understand, f2fs volume can begin from physical sector that is laid
> inside physical erase block. Thereby, in such case of formating the f2fs's operation units will be
> unaligned in relation of physical erase blocks, from my point of view. Maybe, I misunderstand
> something but it can lead to additional FTL operations and performance degradation, from my point of
> view.

I think mkfs already calculates the offset to align that.

Thanks,


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-09  8:31                 ` Lukáš Czerner
  2012-10-09 10:45                     ` Jaegeuk Kim
@ 2012-10-10 10:36                   ` David Woodhouse
  2012-10-12 20:58                     ` Arnd Bergmann
  1 sibling, 1 reply; 154+ messages in thread
From: David Woodhouse @ 2012-10-10 10:36 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Jaegeuk Kim, 'Namjae Jeon', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1093 bytes --]

On Tue, 2012-10-09 at 10:31 +0200, Lukáš Czerner wrote:
> I am sorry but this reply makes me smile. How can you design a fs
> relying on time attack heuristics to figure out what the proper
> layout should be ? Or even endorse such heuristics to be used in
> mkfs ? What we should be focusing on is to push vendors to actually
> give us such information so we can properly propagate that
> throughout the kernel - that's something everyone will benefit from.
> After that the optimization can be done in every file system.
> 
> Promoting time attack heuristics instead of pushing vendors to tell
> us how their hardware should be used is a journey to hell and we've
> been talking about this for a looong time now. And I imagine that
> you especially have quite some persuasion power.

The whole thing is silly. What we actually want on an embedded system is
to ditch the FTL altogether and have direct access to the NAND. Then we
can *know* our file system is behaving optimally. And we don't need
hacks like TRIM to try to make things a little less broken.

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: SQLite on flash (was: [PATCH 00/16] f2fs: introduce flash-friendly file system)
  2012-10-10  9:02                   ` Theodore Ts'o
@ 2012-10-10 11:52                     ` Clemens Ladisch
       [not found]                       ` <50756199.1090103-P6GI/4k7KOmELgA04lAiVw@public.gmane.org>
  0 siblings, 1 reply; 154+ messages in thread
From: Clemens Ladisch @ 2012-10-10 11:52 UTC (permalink / raw)
  To: Theodore Ts'o, Jooyoung Hwang, 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	gregkh, linux-kernel, chur.lee, cm224.lee, linux-fsdevel
  Cc: sqlite-users

(CC'd sqlite-users ML)
Theodore Ts'o wrote:
> On Tue, Oct 09, 2012 at 02:53:26PM -0500, Jooyoung Hwang wrote:
>> I'd like you to refer to the following link as well which is about
>> mobile workload pattern.
>> http://www.cs.cmu.edu/~fuyaoz/courses/15712/report.pdf
>> It's reported that in Android there are frequent issues of fsync and
>> most of them are only for small size of data.
>
> What bothers me is no one is asking the question, *why* is Android
> (and more specifically SQLite and the applications which call SQLite)
> using fsync's so often?  These aren't transaction processing systems,
> after all.

Neither were Firefox's bookmarks and history.  That one got fixed,
but it was a single application.

> So there are two questions that are worth asking here.
> (a) Is SQLite being as flash-friendly as possible,

It would be possible to use the write-ahead log instead of the default
rollback journal, but that is unfortunately not entirely compatible --
HTC once enabled WAL by default on some phones, and all apps that tried
to open a database in read-only mode broke.  If apps are aware of this,
they can enable WAL for their own DBs without problems.

There are some other configuration options, but they, too, have side
effects and thus cannot be enabled by default.

SQLite 4 (currently being developed) will use a log-structured merge
database.

> and (b) do the applications really need as many transaction boundaries
> as they are requesting of SQLite.

Most apps get the default of one transaction per statement because they
do not bother to mange transactions explicitly.


Regards,
Clemens

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: SQLite on flash (was: [PATCH 00/16] f2fs: introduce flash-friendly file system)
       [not found]                       ` <50756199.1090103-P6GI/4k7KOmELgA04lAiVw@public.gmane.org>
@ 2012-10-10 12:47                         ` Richard Hipp
  2012-10-10 17:17                           ` light weight write barriers Andi Kleen
  0 siblings, 1 reply; 154+ messages in thread
From: Richard Hipp @ 2012-10-10 12:47 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: cm224.lee-Sze3O3UU22JBDgjK7y7TUQ, Theodore Ts'o,
	Marco Stornelli, gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vyacheslav Dubeyko,
	Al Viro, Jooyoung Hwang, Jaegeuk Kim,
	chur.lee-Sze3O3UU22JBDgjK7y7TUQ

On Wed, Oct 10, 2012 at 7:52 AM, Clemens Ladisch <clemens-P6GI/4k7KOmELgA04lAiVw@public.gmane.org> wrote:

> (CC'd sqlite-users ML)
> Theodore Ts'o wrote:
> >
> > What bothers me is no one is asking the question, *why* is Android
> > (and more specifically SQLite and the applications which call SQLite)
> > using fsync's so often?  These aren't transaction processing systems,
> > after all.
>

SQLite (and every other transactional storage system) needs a write-barrier
operation in order to prevent database corruption on an power-loss or hard
reset.  By "write-barrier" I mean some method of ensuring that all write
operations (on a particular pair of files, collectively) that occur before
the write-barrier must persist to flash prior to any write operations that
occur after the write-barrier.  In other words, no write operations are
allowed to be re-ordered across the write-barrier.  Without a write-barrier
of some kind, it is not possible to ensure the integrity of a transaction
across a power-loss.

The only write-barrier operation available to us on unix is fsync().  In
the default rollback-journal modes of SQLite, a write-barrier is required
for every SQL-level transaction.  This means that with SQLite in rollback
mode, lots of fsync() operations are occurring.

SQLite also support a write-ahead log (WAL) mode also exists that works
fine on android, and WAL mode only requires a write-barrier on a
checkpoint.  Checkpoints (normally) occur far less often than transactions,
and hence far fewer fsyncs() are required.  However WAL mode requires a
shared-memory segment accessible to all processes.  The shared memory is
used to coordinate access to the WAL file.  On unix, this shared-memory is
obtained using mmap() of a temporary file that is created in the same
directory as the original database.  Unprivileged processes running in
sandboxes without write access to the directory containing the database
file cannot create this temporary file used to implement shared memory, and
thus cannot use WAL mode.  And due to security concerns, a lot of processes
on phones tend to run in unprivileged sandboxes, meaning that they have
difficulty with WAL mode.  There are ways to work around this limitation of
sandboxes, and iOS does make use of those work-arounds.  But Android never
has tried to do so.

The shared-memory temporary files do not have to be in the same directory
as the database.  You can recompile SQLite with the
SQLITE_SHM_DIRECTORY=/dev/shm compile-time option to cause all
shared-memory temporary files to be put in some common place (like
/dev/shm) which is accessible to all processes.  This works fine for many
processes, but fails utterly for processes that try to open SQLite
databases following a chroot().  So it is not the default mode of
operation.  Are there any chroot() processes on Android that use SQLite?
If not, then the SQLITE_SHM_DIRECTORY compile-time option might be a good
idea there.

We would really, really love to have some kind of write-barrier that is
lighter than fsync().  If there is some method other than fsync() for
forcing a write-barrier on Linux that we don't know about, please enlighten
us.

We would also love to have guidance on alternative techniques for obtaining
memory shared across multiple processes that does not involve mmap() of
temporary files.

-- 
D. Richard Hipp
drh-CzDROfG0BjIdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 154+ messages in thread

* light weight write barriers
  2012-10-10 12:47                         ` Richard Hipp
@ 2012-10-10 17:17                           ` Andi Kleen
       [not found]                             ` <m2fw5mtffg.fsf_-_-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
  2012-10-11 16:32                               ` 杨苏立 Yang Su Li
  0 siblings, 2 replies; 154+ messages in thread
From: Andi Kleen @ 2012-10-10 17:17 UTC (permalink / raw)
  To: linux-kernel, sqlite-users, linux-fsdevel, drh

Richard Hipp writes:
>
> We would really, really love to have some kind of write-barrier that is
> lighter than fsync().  If there is some method other than fsync() for
> forcing a write-barrier on Linux that we don't know about, please enlighten
> us.

Could you list the requirements of such a light weight barrier?
i.e. what would it need to do minimally, what's different from
fsync/fdatasync ?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
       [not found]                             ` <m2fw5mtffg.fsf_-_-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
@ 2012-10-10 17:48                               ` Richard Hipp
  2012-10-11 16:38                                   ` Nico Williams
  0 siblings, 1 reply; 154+ messages in thread
From: Richard Hipp @ 2012-10-10 17:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	sqlite-users-CzDROfG0BjIdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, drh-X1OJI8nnyKUAvxtiuMwx3w

On Wed, Oct 10, 2012 at 1:17 PM, Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org> wrote:

> Richard Hipp writes:
> >
> > We would really, really love to have some kind of write-barrier that is
> > lighter than fsync().  If there is some method other than fsync() for
> > forcing a write-barrier on Linux that we don't know about, please
> enlighten
> > us.
>
> Could you list the requirements of such a light weight barrier?
> i.e. what would it need to do minimally, what's different from
> fsync/fdatasync ?
>

For SQLite, the write barrier needs to involve two separate inodes.  The
requirement is this:

After rebooting from a power loss or hard-reset, one or the other of the
following statements must be true of any reader process that examines the
two inodes associated with the write barrier:  (1) it can see the complete
results every write operation (and unlink) that occurred before the write
barrier or (2) it can see no results from any write operation (or unlink)
that occurred after the write barrier.

In the case of SQLite, the write-barrier never needs to involve more than
two inodes:  the original database file and the transaction journal (which
might be either a rollback journal or a write-ahead log, depending on how
SQLite is configured.)  But I would suppose that a general-purpose write
barrier mechanism should involve an arbitrary number of inodes.

Fsync() is a very close approximation to a write barrier since (when it
works as advertised) all pending I/O reaches persistent storage before the
fsync() returns.  And since no subsequent I/Os are issued until after the
fsync() returns, the requirements above a clearly satisfied.  But it really
isn't necessary to actually wait for content to reach persistent storage as
long as we know that content will not reach persistent storage out-of-order.

Note also that when fsync() works as advertised, SQLite transactions are
ACID.  But when fsync() is reduced to a write-barrier, we loss the D
(durable) and transactions are only ACI.  In our experience, nobody really
cares very much about durable across a power-loss.  People are mainly
interested in Atomic, Consistent, and Isolated.  If you take a power loss
and then after reboot you find the 10 seconds of work prior to the power
loss is missing, nobody much cares about that as long as all of the prior
work is still present and consistent.



>
> -Andi
>
> --
> ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org -- Speaking for myself only
>



-- 
D. Richard Hipp
drh-CzDROfG0BjIdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-10  9:43                   ` Jaegeuk Kim
@ 2012-10-11  3:14                     ` Namjae Jeon
       [not found]                       ` <CAN863PuyMkSZtZCvqX+kwei9v=rnbBYVYr3TqBXF_6uxwJe2_Q@mail.gmail.com>
  2012-10-12 12:30                     ` Vyacheslav Dubeyko
  1 sibling, 1 reply; 154+ messages in thread
From: Namjae Jeon @ 2012-10-11  3:14 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Vyacheslav Dubeyko, Marco Stornelli, Jaegeuk Kim, Al Viro, tytso,
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

2012/10/10 Jaegeuk Kim <jaegeuk.kim@samsung.com>:

>>
>> I mean that every volume is placed inside any partition (MTD or GPT). Every partition begins from any
>> physical sector. So, as I can understand, f2fs volume can begin from physical sector that is laid
>> inside physical erase block. Thereby, in such case of formating the f2fs's operation units will be
>> unaligned in relation of physical erase blocks, from my point of view. Maybe, I misunderstand
>> something but it can lead to additional FTL operations and performance degradation, from my point of
>> view.
>
> I think mkfs already calculates the offset to align that.
I think this answer is not what he want.
If you don't use partition table such as dos partition table or gpt, I
think that it is possible to align using mkfs.
But If we should consider partition table space in storage, I don't
understand how it  could be align using mkfs.

Thanks.
> Thanks,
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]                             ` <m2fw5mtffg.fsf_-_-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
@ 2012-10-11 16:32                               ` 杨苏立 Yang Su Li
  0 siblings, 0 replies; 154+ messages in thread
From: 杨苏立 Yang Su Li @ 2012-10-11 16:32 UTC (permalink / raw)
  To: General Discussion of SQLite Database; +Cc: linux-kernel, linux-fsdevel, drh

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before....

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?
(I believe Windows does this to an extent, but not quite sure).

Thanks a lot

Suli


On Wed, Oct 10, 2012 at 12:17 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Richard Hipp writes:
>>
>> We would really, really love to have some kind of write-barrier that is
>> lighter than fsync().  If there is some method other than fsync() for
>> forcing a write-barrier on Linux that we don't know about, please enlighten
>> us.
>
> Could you list the requirements of such a light weight barrier?
> i.e. what would it need to do minimally, what's different from
> fsync/fdatasync ?
>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-11 16:32                               ` 杨苏立 Yang Su Li
  0 siblings, 0 replies; 154+ messages in thread
From: 杨苏立 Yang Su Li @ 2012-10-11 16:32 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, drh-X1OJI8nnyKUAvxtiuMwx3w

I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before....

I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?

As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?
(I believe Windows does this to an extent, but not quite sure).

Thanks a lot

Suli


On Wed, Oct 10, 2012 at 12:17 PM, Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org> wrote:
> Richard Hipp writes:
>>
>> We would really, really love to have some kind of write-barrier that is
>> lighter than fsync().  If there is some method other than fsync() for
>> forcing a write-barrier on Linux that we don't know about, please enlighten
>> us.
>
> Could you list the requirements of such a light weight barrier?
> i.e. what would it need to do minimally, what's different from
> fsync/fdatasync ?
>
> -Andi
>
> --
> ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org -- Speaking for myself only
> _______________________________________________
> sqlite-users mailing list
> sqlite-users-CzDROfG0BjIdnm+yROfE0A@public.gmane.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-11 16:38                                   ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-11 16:38 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Andi Kleen, linux-fsdevel, linux-kernel, drh

On Wed, Oct 10, 2012 at 12:48 PM, Richard Hipp <drh@sqlite.org> wrote:
>> Could you list the requirements of such a light weight barrier?
>> i.e. what would it need to do minimally, what's different from
>> fsync/fdatasync ?
>
> For SQLite, the write barrier needs to involve two separate inodes.  The
> requirement is this:

...

> Note also that when fsync() works as advertised, SQLite transactions are
> ACID.  But when fsync() is reduced to a write-barrier, we loss the D
> (durable) and transactions are only ACI.  In our experience, nobody really
> cares very much about durable across a power-loss.  People are mainly
> interested in Atomic, Consistent, and Isolated.  If you take a power loss
> and then after reboot you find the 10 seconds of work prior to the power
> loss is missing, nobody much cares about that as long as all of the prior
> work is still present and consistent.

There is something you can do: use a combination of COW on-disk
formats in such a way that it's possible to detect partially-committed
transactions and rollback to the last good known root, and
backgrounded fsync()s (i.e., in a separate thread, without waiting for
the fsync() to complete).

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-11 16:38                                   ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-11 16:38 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andi Kleen,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, drh-X1OJI8nnyKUAvxtiuMwx3w

On Wed, Oct 10, 2012 at 12:48 PM, Richard Hipp <drh-CzDROfG0BjIdnm+yROfE0A@public.gmane.org> wrote:
>> Could you list the requirements of such a light weight barrier?
>> i.e. what would it need to do minimally, what's different from
>> fsync/fdatasync ?
>
> For SQLite, the write barrier needs to involve two separate inodes.  The
> requirement is this:

...

> Note also that when fsync() works as advertised, SQLite transactions are
> ACID.  But when fsync() is reduced to a write-barrier, we loss the D
> (durable) and transactions are only ACI.  In our experience, nobody really
> cares very much about durable across a power-loss.  People are mainly
> interested in Atomic, Consistent, and Isolated.  If you take a power loss
> and then after reboot you find the 10 seconds of work prior to the power
> loss is missing, nobody much cares about that as long as all of the prior
> work is still present and consistent.

There is something you can do: use a combination of COW on-disk
formats in such a way that it's possible to detect partially-committed
transactions and rollback to the last good known root, and
backgrounded fsync()s (i.e., in a separate thread, without waiting for
the fsync() to complete).

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-11 16:48                                     ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-11 16:48 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Andi Kleen, linux-fsdevel, linux-kernel, drh

To expand a bit, the on-disk format needs to allow the roots of N of
the last transactions to be/remain reachable at all times.  At open
time you look for the latest transaction, verify that it has been
written[0] completely, then use it, else look for the preceding
transaction, verify it, and so on.

N needs to be at least 2: the last and the preceding transactions.  No
blocks should be freed or reused for any transactions still in use or
possible use (e.g., for power failure recovery).  For high read
concurrency you can allow connections to lock a past transaction so
that no blocks are freed that are needed to access the DB at that
state.

This all goes back to 1980s DB and filesystem concepts.  See, for
example, the BSD4.4 Log Structure Filesystem.  (I mention this in case
there are concerns about patents, though IANAL and I make no
particular assertions here other than that there is plenty of old
prior art and expired patents that can probably be used to obtain
sufficient certainty as to the patent law risks in the approach
described herein.)

[0] E.g., check a transaction block manifest and check that those
blocks were written correctly; or traverse the tree looking for
differences to the previous transaction; this may require checking
block contents checksums.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-11 16:48                                     ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-11 16:48 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andi Kleen,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, drh-X1OJI8nnyKUAvxtiuMwx3w

To expand a bit, the on-disk format needs to allow the roots of N of
the last transactions to be/remain reachable at all times.  At open
time you look for the latest transaction, verify that it has been
written[0] completely, then use it, else look for the preceding
transaction, verify it, and so on.

N needs to be at least 2: the last and the preceding transactions.  No
blocks should be freed or reused for any transactions still in use or
possible use (e.g., for power failure recovery).  For high read
concurrency you can allow connections to lock a past transaction so
that no blocks are freed that are needed to access the DB at that
state.

This all goes back to 1980s DB and filesystem concepts.  See, for
example, the BSD4.4 Log Structure Filesystem.  (I mention this in case
there are concerns about patents, though IANAL and I make no
particular assertions here other than that there is plenty of old
prior art and expired patents that can probably be used to obtain
sufficient certainty as to the patent law risks in the approach
described herein.)

[0] E.g., check a transaction block manifest and check that those
blocks were written correctly; or traverse the tree looking for
differences to the previous transaction; this may require checking
block contents checksums.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-11 16:32                               ` 杨苏立 Yang Su Li
  (?)
@ 2012-10-11 17:41                               ` Christoph Hellwig
  -1 siblings, 0 replies; 154+ messages in thread
From: Christoph Hellwig @ 2012-10-11 17:41 UTC (permalink / raw)
  To: ????????? Yang Su Li
  Cc: General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh

On Thu, Oct 11, 2012 at 11:32:27AM -0500, ????????? Yang Su Li wrote:
> I am not quite whether I should ask this question here, but in terms
> of light weight barrier/fsync, could anyone tell me why the device
> driver / OS provide the barrier interface other than some other
> abstractions anyway? I am sorry if this sounds like a stupid questions
> or it has been discussed before....

It does not.  Except for the legacy mount option naming there is no such
thing as a barrier in Linux these days.


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-10  9:43                   ` Jaegeuk Kim
  2012-10-11  3:14                     ` Namjae Jeon
@ 2012-10-12 12:30                     ` Vyacheslav Dubeyko
  2012-10-12 14:25                       ` Jaegeuk Kim
  1 sibling, 1 reply; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-12 12:30 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

On Wed, 2012-10-10 at 18:43 +0900, Jaegeuk Kim wrote:
> [snip]
> > > How about the following scenario?
> > > 1. data "a" is newly written.
> > > 2. checkpoint "A" is done.
> > > 3. data "a" is truncated.
> > > 4. checkpoint "B" is done.
> > >
> > > If fs supports multiple snapshots like "A" and "B" to users, it cannot reuse the space allocated by
> > > data "a" after checkpoint "B" even though data "a" is safely truncated by checkpoint "B".
> > > This is because fs should keep data "a" to prepare a roll-back to "A".
> > > So, even though user sees some free space, LFS may suffer from cleaning due to the exhausted free
> > space.
> > > If users want to avoid this, they have to remove snapshots by themselves. Or, maybe automatically?
> > >
> > 
> > I feel that here it exists some misunderstanding in checkpoint/snapshot terminology (especially, for
> > the NILFS2 case). It is possible that NILFS2 volume can contain only checkpoints (if user doesn't
> > created any snapshot). You are right, snapshot cannot be deleted because, in other word, user marked
> > this file system state as important point. But checkpoints can be reclaimed easily. I can't see any
> > problem to reclaim free space from checkpoints in above-mentioned scenario in the case of NILFS2. But
> 
> I meant that snapshot does checkpoint.
> And, the problem is related to real file system utilization managed by NILFS2.
>                      [fs utilization to users]   [fs utilization managed by NILFS2]
>                                 X - 1                       X - 1
> 1. new data "a"            X                            X
> 2. snapshot "A"            X                            X
> 3. truncate "a"            X - 1                       X
> 4. snapshot "B"            X - 1                       X
> 
> After this, user can see X-1, but the performance will be affected by X.
> Until the snapshot "A" is removed, user will experience the performance determined by X.
> Do I misunderstand?
> 

Ok. Maybe I have some misunderstanding but checkpoint and snapshot are different things for me (especially, in the case of NILFS2). :-)

The most important is that f2fs has more efficient scheme of working with checkpoints, from your point of view. If you are right then it is very good. And I need to be more familiar with f2fs code.

[snip]
> > As I know, NILFS2 has Garbage Collector that removes checkpoints automatically in background. But it
> > is possible also to force removing as checkpoints as snapshots by hands with special utility using. As
> 
> If users may not want to remove the snapshots automatically, should they configure not to do this too?
> 

As I know, NILFS2 doesn't delete snapshots automatically but checkpoints - yes. Moreover, it exists nilfs_cleanerd.conf configuration file that makes possible to manage by NILFS cleanerd daemon's behavior (min/max number of clean segments, selection policy, check/clean intervals and so on).

[snip]
> > > IMHO, user does not need to know how many snapshots there exist and track the fs utilization all the
> > time.
> > > (off list: I don't know why cleaning process should be tuned by users.)
> > >
> > 
> > What do you plan to do in the case of users' complains about issues with free space reclaiming? If
> > user doesn't know about checkpoints and haven't any tools for accessing to checkpoints then how is it
> > possible to investigate issues with free space reclaiming on an user side?
> 
> Could you explain why reclaiming free space is an issue?
> IMHO, that issue is caused by adopting multiple snapshots.
> 

I didn't mean that reclaiming free space is an issue. I hope that f2fs is stable but unfortunately it is not possible for any software to be completely without bugs. So, anyway, f2fs users can have some issues during using. One of the possible issue can be unexpected situation with not reclaiming of free space. So, my question was about possibility to investigate such bug on the user's side. From my point of view, NILFS2 has very good utilities for such investigation.

[snip]
> > > In our experiments *also* on android phones, we've seen many random patterns with frequent fsync
> > calls.
> > > We found that the main problem is database, and I think f2fs is beneficial to this.
> > 
> > I think that database is not main use-case on Android phones. The dominating use-case can be operation
> > by multimedia information and operations with small files, from my point of view.
> > 
> > So, it is possible to extract such key points from the shared paper: (1) file has complex structure;
> > (2) sequential access is not sequential; (3) auxiliary files dominate; (4) multiple threads perform
> > I/O.
> > 
> > I am afraid that random modification of different part of files and I/O operations from multiple
> > threads can lead to significant fragmentation as file fragments as directory meta-information because
> > of garbage collection.
> 
> Could you explain in more detail?
> 

I mean that complex structure of modern files can lead to random modification of small file's parts. Moreover, such modifications can occur from multiple threads. So, it means for me that Copy-On-Write policy can lead to file's content fragmentation. Then GC can make additional fragmentation also.

But maybe I have some misunderstanding of f2fs internal techniques.

With the best regards,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-12 12:30                     ` Vyacheslav Dubeyko
@ 2012-10-12 14:25                       ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-12 14:25 UTC (permalink / raw)
  To: Vyacheslav Dubeyko
  Cc: Jaegeuk Kim, 'Marco Stornelli', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

2012-10-12 (금), 16:30 +0400, Vyacheslav Dubeyko:
> On Wed, 2012-10-10 at 18:43 +0900, Jaegeuk Kim wrote:
> > [snip]
> > > > How about the following scenario?
> > > > 1. data "a" is newly written.
> > > > 2. checkpoint "A" is done.
> > > > 3. data "a" is truncated.
> > > > 4. checkpoint "B" is done.
> > > >
> > > > If fs supports multiple snapshots like "A" and "B" to users, it cannot reuse the space allocated by
> > > > data "a" after checkpoint "B" even though data "a" is safely truncated by checkpoint "B".
> > > > This is because fs should keep data "a" to prepare a roll-back to "A".
> > > > So, even though user sees some free space, LFS may suffer from cleaning due to the exhausted free
> > > space.
> > > > If users want to avoid this, they have to remove snapshots by themselves. Or, maybe automatically?
> > > >
> > > 
> > > I feel that here it exists some misunderstanding in checkpoint/snapshot terminology (especially, for
> > > the NILFS2 case). It is possible that NILFS2 volume can contain only checkpoints (if user doesn't
> > > created any snapshot). You are right, snapshot cannot be deleted because, in other word, user marked
> > > this file system state as important point. But checkpoints can be reclaimed easily. I can't see any
> > > problem to reclaim free space from checkpoints in above-mentioned scenario in the case of NILFS2. But
> > 
> > I meant that snapshot does checkpoint.
> > And, the problem is related to real file system utilization managed by NILFS2.
> >                      [fs utilization to users]   [fs utilization managed by NILFS2]
> >                                 X - 1                       X - 1
> > 1. new data "a"            X                            X
> > 2. snapshot "A"            X                            X
> > 3. truncate "a"            X - 1                       X
> > 4. snapshot "B"            X - 1                       X
> > 
> > After this, user can see X-1, but the performance will be affected by X.
> > Until the snapshot "A" is removed, user will experience the performance determined by X.
> > Do I misunderstand?
> > 
> 
> Ok. Maybe I have some misunderstanding but checkpoint and snapshot are different things for me (especially, in the case of NILFS2). :-)
> 
> The most important is that f2fs has more efficient scheme of working with checkpoints, from your point of view. If you are right then it is very good. And I need to be more familiar with f2fs code.
> 

Ok, thanks.

> [snip]
> > > As I know, NILFS2 has Garbage Collector that removes checkpoints automatically in background. But it
> > > is possible also to force removing as checkpoints as snapshots by hands with special utility using. As
> > 
> > If users may not want to remove the snapshots automatically, should they configure not to do this too?
> > 
> 
> As I know, NILFS2 doesn't delete snapshots automatically but checkpoints - yes. Moreover, it exists nilfs_cleanerd.conf configuration file that makes possible to manage by NILFS cleanerd daemon's behavior (min/max number of clean segments, selection policy, check/clean intervals and so on).
> 

Ok.

> [snip]
> > > > IMHO, user does not need to know how many snapshots there exist and track the fs utilization all the
> > > time.
> > > > (off list: I don't know why cleaning process should be tuned by users.)
> > > >
> > > 
> > > What do you plan to do in the case of users' complains about issues with free space reclaiming? If
> > > user doesn't know about checkpoints and haven't any tools for accessing to checkpoints then how is it
> > > possible to investigate issues with free space reclaiming on an user side?
> > 
> > Could you explain why reclaiming free space is an issue?
> > IMHO, that issue is caused by adopting multiple snapshots.
> > 
> 
> I didn't mean that reclaiming free space is an issue. I hope that f2fs
> is stable but unfortunately it is not possible for any software to be
> completely without bugs. So, anyway, f2fs users can have some issues
> during using. One of the possible issue can be unexpected situation
> with not reclaiming of free space. So, my question was about
> possibility to investigate such bug on the user's side. From my point
> of view, NILFS2 has very good utilities for such investigation.

You mean fsck?
Of course, we've implemented fsck tool also.
But, why I didn't open it is that code is a mess.
Another reason is that current fsck tool only checks
the consistency of f2fs.
Now we're still working on it to open.

> 
> [snip]
> > > > In our experiments *also* on android phones, we've seen many random patterns with frequent fsync
> > > calls.
> > > > We found that the main problem is database, and I think f2fs is beneficial to this.
> > > 
> > > I think that database is not main use-case on Android phones. The dominating use-case can be operation
> > > by multimedia information and operations with small files, from my point of view.
> > > 
> > > So, it is possible to extract such key points from the shared paper: (1) file has complex structure;
> > > (2) sequential access is not sequential; (3) auxiliary files dominate; (4) multiple threads perform
> > > I/O.
> > > 
> > > I am afraid that random modification of different part of files and I/O operations from multiple
> > > threads can lead to significant fragmentation as file fragments as directory meta-information because
> > > of garbage collection.
> > 
> > Could you explain in more detail?
> > 
> 
> I mean that complex structure of modern files can lead to random modification of small file's parts.
> Moreover, such modifications can occur from multiple threads.
> So, it means for me that Copy-On-Write policy can lead to file's content fragmentation.
> Then GC can make additional fragmentation also.
> But maybe I have some misunderstanding of f2fs internal techniques.
> 

Right. Random modification may cause data fragmentation due to COW in LFS.
But, this is from the host side view only.
If we consider FTL with file system adopting the in-place-update scheme,
eventually FTL should handle the fragmentation issue instead of
file system.
So, I think fragmentation is not a particular issue in LFS only.

> With the best regards,
> Vyacheslav Dubeyko.
> 
> 

-- 
Jaegeuk Kim
Samsung


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-10  4:53                         ` Theodore Ts'o
  (?)
@ 2012-10-12 20:55                         ` Arnd Bergmann
  -1 siblings, 0 replies; 154+ messages in thread
From: Arnd Bergmann @ 2012-10-12 20:55 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lukáš Czerner, Jaegeuk Kim, 'Namjae Jeon',
	'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

On Wednesday 10 October 2012 00:53:51 Theodore Ts'o wrote:
> On Tue, Oct 09, 2012 at 01:01:24PM +0200, Lukáš Czerner wrote:
> > Do not get me wrong, I do not think it is worth to wait for vendors
> > to come to their senses, but it is worth constantly reminding that
> > we *need* this kind of information and those heuristics are not
> > feasible in the long run anyway.
> 
> A number of us has been telling flash vendors exactly this.  The
> technical people do seem to understand.  It's management who seem to
> be primarily clueless, even though this information can be extracted
> by employing timing attacks on the media.  I've pointed this out
> before, and the technical people agree that trying to keep this
> information as a "trade secret" is pointless, stupid, and
> counterproductive.  Trying to get the pointy-haired bosses to
> understand may take quite a while.

For eMMC, I think we should start out defaulting to the characteristics
that are reported by the device, because they are usually correct
and those vendors for which that is not true can hopefully
come to their senses when they see how f2fs performs by default.

For USB media, the protocol does not allow you to specify the
erase block size, so we have to guess.

For SD cards, there is a field in the card's registers, but I've
never seen any value in there other than 4 MB, and in most cases
where that is not true, the standard does not allow encoding
the correct amount: it only allows power-of-two numbers up to
4 MB, and typical numbers these days are 3 MB, 6 MB or 8 MB.

	Arnd

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-10 10:36                   ` David Woodhouse
@ 2012-10-12 20:58                     ` Arnd Bergmann
  2012-10-13  4:26                       ` Namjae Jeon
  0 siblings, 1 reply; 154+ messages in thread
From: Arnd Bergmann @ 2012-10-12 20:58 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Lukáš Czerner, Jaegeuk Kim, 'Namjae Jeon',
	'Vyacheslav Dubeyko', 'Marco Stornelli',
	'Jaegeuk Kim', 'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

On Wednesday 10 October 2012 11:36:14 David Woodhouse wrote:
> The whole thing is silly. What we actually want on an embedded system is
> to ditch the FTL altogether and have direct access to the NAND. Then we
> can know our file system is behaving optimally. And we don't need
> hacks like TRIM to try to make things a little less broken.

I think it's safe to say that the times for raw flash in consumer devices
are over, whether we like it or not. Even if we could go back to MTD
for internal storage, we'd still need something better than what we
have for removable flash storage such as USB and SD.

(and I know that xD cards are basically raw flash, but have you tried
to buy one recently?)

	Arnd

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-12 20:58                     ` Arnd Bergmann
@ 2012-10-13  4:26                       ` Namjae Jeon
  2012-10-13 12:37                           ` Jaegeuk Kim
  0 siblings, 1 reply; 154+ messages in thread
From: Namjae Jeon @ 2012-10-13  4:26 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Arnd Bergmann, David Woodhouse, Lukáš Czerner,
	Vyacheslav Dubeyko, Marco Stornelli, Jaegeuk Kim, Al Viro, tytso,
	gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

Is there high possibility that the storage device can be rapidly
worn-out by cleaning process ? e.g. severe fragmentation situation by
creating and removing small files.

And you told us only advantages of f2fs. Would you tell us the disadvantages ?

Thanks.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-13  4:26                       ` Namjae Jeon
@ 2012-10-13 12:37                           ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-13 12:37 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: Jaegeuk Kim, Arnd Bergmann, David Woodhouse, Luk Czerner,
	Vyacheslav Dubeyko, Marco Stornelli, Al Viro, tytso, gregkh,
	linux-kernel, chur.lee, cm224.lee, jooyoung.hwang, linux-fsdevel

2012-10-13 (토), 13:26 +0900, Namjae Jeon:
> Is there high possibility that the storage device can be rapidly
> worn-out by cleaning process ? e.g. severe fragmentation situation by
> creating and removing small files.
> 

Yes, the cleaning process in F2FS induces additional writes so that
flash storage can be worn out quickly.
However, how about in traditonal file systems?
As all of us know that, FTL has an wear-leveling issue too due to the
garbage collection overhead that is fundamentally similar to the
cleaning overhead in LFS or F2FS.

So, what's the difference between them?
IMHO, the major factor to reduce the cleaning or garbage collection
overhead is how to efficiently separate hot and cold data.
So, which is a better layer between FTL and file system to achieve that?
I think the answer is the file system, since the file system has much
more information on such a hotness of all the data, but FTL doesn't know
or is hard to figure out that kind of information.

Therefore, I think the LFS approach is more beneficial to span the life
time of the storage rather than traditional one.
And, in order to do this perfectly, one thing is a criteria, the
alignment between FTL and F2FS.

> And you told us only advantages of f2fs. Would you tell us the disadvantages ?

I think there is a scenario like this.
1) One big file is created and written data sequentially.
2) Many random writes are done across the whole file range.
3) User discards cached data by doing "drop_caches" or "reboot".

At this point, I worry about the sequential read performance due to the
fragmentation.
I don't know how frequently this use-case happens, but it is one of cons
in the LFS approach.
Nevertheless, I'm thinking that the performance could be enhanced by
cooperating with a readahead mechanism in VFS.

Thanks,

> 
> Thanks.

-- 
Jaegeuk Kim
Samsung


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
@ 2012-10-13 12:37                           ` Jaegeuk Kim
  0 siblings, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-13 12:37 UTC (permalink / raw)
  To: Namjae Jeon
  Cc: Jaegeuk Kim, Arnd Bergmann, David Woodhouse, Luk Czerner,
	Vyacheslav Dubeyko, Marco Stornelli, Al Viro, tytso, gregkh,
	linux-kernel, chur.lee, cm224.lee, jooyoung.hwang, linux-fsdevel

2012-10-13 (토), 13:26 +0900, Namjae Jeon:
> Is there high possibility that the storage device can be rapidly
> worn-out by cleaning process ? e.g. severe fragmentation situation by
> creating and removing small files.
> 

Yes, the cleaning process in F2FS induces additional writes so that
flash storage can be worn out quickly.
However, how about in traditonal file systems?
As all of us know that, FTL has an wear-leveling issue too due to the
garbage collection overhead that is fundamentally similar to the
cleaning overhead in LFS or F2FS.

So, what's the difference between them?
IMHO, the major factor to reduce the cleaning or garbage collection
overhead is how to efficiently separate hot and cold data.
So, which is a better layer between FTL and file system to achieve that?
I think the answer is the file system, since the file system has much
more information on such a hotness of all the data, but FTL doesn't know
or is hard to figure out that kind of information.

Therefore, I think the LFS approach is more beneficial to span the life
time of the storage rather than traditional one.
And, in order to do this perfectly, one thing is a criteria, the
alignment between FTL and F2FS.

> And you told us only advantages of f2fs. Would you tell us the disadvantages ?

I think there is a scenario like this.
1) One big file is created and written data sequentially.
2) Many random writes are done across the whole file range.
3) User discards cached data by doing "drop_caches" or "reboot".

At this point, I worry about the sequential read performance due to the
fragmentation.
I don't know how frequently this use-case happens, but it is one of cons
in the LFS approach.
Nevertheless, I'm thinking that the performance could be enhanced by
cooperating with a readahead mechanism in VFS.

Thanks,

> 
> Thanks.

-- 
Jaegeuk Kim
Samsung

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-13 12:37                           ` Jaegeuk Kim
  (?)
@ 2012-10-17 11:12                           ` Namjae Jeon
       [not found]                             ` <000001cdacef$b2f6eaa0$18e4bfe0$%kim@samsung.com>
  -1 siblings, 1 reply; 154+ messages in thread
From: Namjae Jeon @ 2012-10-17 11:12 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Jaegeuk Kim, Arnd Bergmann, David Woodhouse, Luk Czerner,
	Vyacheslav Dubeyko, Marco Stornelli, Al Viro, tytso, gregkh,
	linux-kernel, chur.lee, cm224.lee, jooyoung.hwang, linux-fsdevel

2012/10/13, Jaegeuk Kim <jaegeuk.kim@gmail.com>:
> 2012-10-13 (토), 13:26 +0900, Namjae Jeon:
>> Is there high possibility that the storage device can be rapidly
>> worn-out by cleaning process ? e.g. severe fragmentation situation by
>> creating and removing small files.
>>
>
> Yes, the cleaning process in F2FS induces additional writes so that
> flash storage can be worn out quickly.
> However, how about in traditonal file systems?
> As all of us know that, FTL has an wear-leveling issue too due to the
> garbage collection overhead that is fundamentally similar to the
> cleaning overhead in LFS or F2FS.
>
> So, what's the difference between them?
> IMHO, the major factor to reduce the cleaning or garbage collection
> overhead is how to efficiently separate hot and cold data.
> So, which is a better layer between FTL and file system to achieve that?
> I think the answer is the file system, since the file system has much
> more information on such a hotness of all the data, but FTL doesn't know
> or is hard to figure out that kind of information.
>
> Therefore, I think the LFS approach is more beneficial to span the life
> time of the storage rather than traditional one.
> And, in order to do this perfectly, one thing is a criteria, the
> alignment between FTL and F2FS.

As you know, Normally users don't use one big partition on eMMC.
It means they divide several small parititions.
And F2fs will work on each small partition.
And eMMC's FTL is globally working on whole device.
I can not imagine how to work synchronously beween cleaning process of
f2fs and FTL of eMMC.

And Would you share ppt or document of f2fs if Korea Linux Forum is finished ?

Thanks.
>
>> And you told us only advantages of f2fs. Would you tell us the
>> disadvantages ?
>
> I think there is a scenario like this.
> 1) One big file is created and written data sequentially.
> 2) Many random writes are done across the whole file range.
> 3) User discards cached data by doing "drop_caches" or "reboot".
>
> At this point, I worry about the sequential read performance due to the
> fragmentation.
> I don't know how frequently this use-case happens, but it is one of cons
> in the LFS approach.
> Nevertheless, I'm thinking that the performance could be enhanced by
> cooperating with a readahead mechanism in VFS.
>
> Thanks,
>
>>
>> Thanks.
>
> --
> Jaegeuk Kim
> Samsung
>
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
       [not found]                       ` <CAN863PuyMkSZtZCvqX+kwei9v=rnbBYVYr3TqBXF_6uxwJe2_Q@mail.gmail.com>
@ 2012-10-17 11:13                         ` Namjae Jeon
  2012-10-17 23:06                           ` Changman Lee
  0 siblings, 1 reply; 154+ messages in thread
From: Namjae Jeon @ 2012-10-17 11:13 UTC (permalink / raw)
  To: Changman Lee
  Cc: Jaegeuk Kim, Vyacheslav Dubeyko, Marco Stornelli, Jaegeuk Kim,
	Al Viro, tytso, gregkh, linux-kernel, chur.lee, cm224.lee,
	jooyoung.hwang, linux-fsdevel

2012/10/11, Changman Lee <cm224.lee@gmail.com>:
> 2012년 10월 11일 목요일에 Namjae Jeon<linkinjeon@gmail.com>님이 작성:
>> 2012/10/10 Jaegeuk Kim <jaegeuk.kim@samsung.com>:
>>
>>>>
>>>> I mean that every volume is placed inside any partition (MTD or GPT).
> Every partition begins from any
>>>> physical sector. So, as I can understand, f2fs volume can begin from
> physical sector that is laid
>>>> inside physical erase block. Thereby, in such case of formating the
> f2fs's operation units will be
>>>> unaligned in relation of physical erase blocks, from my point of view.
> Maybe, I misunderstand
>>>> something but it can lead to additional FTL operations and performance
> degradation, from my point of
>>>> view.
>>>
>>> I think mkfs already calculates the offset to align that.
>> I think this answer is not what he want.
>> If you don't use partition table such as dos partition table or gpt, I
>> think that it is possible to align using mkfs.
>> But If we should consider partition table space in storage, I don't
>> understand how it  could be align using mkfs.
>>
>> Thanks.
>
> We can know the physical starting sector address of any partitions from
> hdio geometry information got by ioctl.
If so, first block and end block of partition are useless ?

Thanks.
>
>>> Thanks,
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-17 11:13                         ` Namjae Jeon
@ 2012-10-17 23:06                           ` Changman Lee
  0 siblings, 0 replies; 154+ messages in thread
From: Changman Lee @ 2012-10-17 23:06 UTC (permalink / raw)
  To: 'Namjae Jeon', 'Changman Lee'
  Cc: 'Jaegeuk Kim', 'Vyacheslav Dubeyko',
	'Marco Stornelli', 'Jaegeuk Kim',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, jooyoung.hwang,
	linux-fsdevel



> -----Original Message-----
> From: Namjae Jeon [mailto:linkinjeon@gmail.com]
> Sent: Wednesday, October 17, 2012 8:14 PM
> To: Changman Lee
> Cc: Jaegeuk Kim; Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro;
> tytso@mit.edu; gregkh@linuxfoundation.org; linux-kernel@vger.kernel.org;
> chur.lee@samsung.com; cm224.lee@samsung.com; jooyoung.hwang@samsung.com;
> linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> 2012/10/11, Changman Lee <cm224.lee@gmail.com>:
> > 2012년 10월 11일 목요일에 Namjae Jeon<linkinjeon@gmail.com>님이 작성:
> >> 2012/10/10 Jaegeuk Kim <jaegeuk.kim@samsung.com>:
> >>
> >>>>
> >>>> I mean that every volume is placed inside any partition (MTD or GPT).
> > Every partition begins from any
> >>>> physical sector. So, as I can understand, f2fs volume can begin from
> > physical sector that is laid
> >>>> inside physical erase block. Thereby, in such case of formating the
> > f2fs's operation units will be
> >>>> unaligned in relation of physical erase blocks, from my point of view.
> > Maybe, I misunderstand
> >>>> something but it can lead to additional FTL operations and performance
> > degradation, from my point of
> >>>> view.
> >>>
> >>> I think mkfs already calculates the offset to align that.
> >> I think this answer is not what he want.
> >> If you don't use partition table such as dos partition table or gpt, I
> >> think that it is possible to align using mkfs.
> >> But If we should consider partition table space in storage, I don't
> >> understand how it  could be align using mkfs.
> >>
> >> Thanks.
> >
> > We can know the physical starting sector address of any partitions from
> > hdio geometry information got by ioctl.
> If so, first block and end block of partition are useless ?
> 
> Thanks.

For example.
If we try to align a start point of F2FS in 2MB but start sector of any partition is not aligned in 2MB,
and of course F2FS will have some unused blocks. Instead, F2FS could reduce gc cost of ftl.
I don't know my answer is what you want.

> >
> >>> Thanks,
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> > in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> >> in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >>
> >


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
       [not found]                             ` <000001cdacef$b2f6eaa0$18e4bfe0$%kim@samsung.com>
@ 2012-10-18 13:39                               ` Vyacheslav Dubeyko
  2012-10-18 22:14                                 ` Jaegeuk Kim
  2012-10-19  9:20                                 ` NeilBrown
  0 siblings, 2 replies; 154+ messages in thread
From: Vyacheslav Dubeyko @ 2012-10-18 13:39 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: 'Namjae Jeon', 'Jaegeuk Kim',
	'Arnd Bergmann', 'David Woodhouse',
	'Luk Czerner', 'Marco Stornelli',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

[snip]
> > 
> > And Would you share ppt or document of f2fs if Korea Linux Forum is finished ?
> > 
> 
> Here I attached the slides, and LF will also share the slides.
> Thanks,
> 

I had hope that slides will have more detailed description. Maybe it is
good for Linux Forum. But do you plan to publish more detailed
description of F2FS architecture, advantages/disadvantages in the form
of article? It makes sense from my point of view.

With the best regards,
Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-18 13:39                               ` Vyacheslav Dubeyko
@ 2012-10-18 22:14                                 ` Jaegeuk Kim
  2012-10-19  9:20                                 ` NeilBrown
  1 sibling, 0 replies; 154+ messages in thread
From: Jaegeuk Kim @ 2012-10-18 22:14 UTC (permalink / raw)
  To: 'Vyacheslav Dubeyko'
  Cc: 'Namjae Jeon', 'Jaegeuk Kim',
	'Arnd Bergmann', 'David Woodhouse',
	'Luk Czerner', 'Marco Stornelli',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

> [snip]
> > >
> > > And Would you share ppt or document of f2fs if Korea Linux Forum is finished ?
> > >
> >
> > Here I attached the slides, and LF will also share the slides.
> > Thanks,
> >
> 
> I had hope that slides will have more detailed description. Maybe it is
> good for Linux Forum. But do you plan to publish more detailed
> description of F2FS architecture, advantages/disadvantages in the form
> of article? It makes sense from my point of view.

Of course.
Jooyoung was starting to write a paper on f2fs.
I don't know when to publish, but we have a lot of works now. :)
Thanks,

> 
> With the best regards,
> Vyacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
  2012-10-18 13:39                               ` Vyacheslav Dubeyko
  2012-10-18 22:14                                 ` Jaegeuk Kim
@ 2012-10-19  9:20                                 ` NeilBrown
  1 sibling, 0 replies; 154+ messages in thread
From: NeilBrown @ 2012-10-19  9:20 UTC (permalink / raw)
  To: Vyacheslav Dubeyko
  Cc: Jaegeuk Kim, 'Namjae Jeon', 'Jaegeuk Kim',
	'Arnd Bergmann', 'David Woodhouse',
	'Luk Czerner', 'Marco Stornelli',
	'Al Viro',
	tytso, gregkh, linux-kernel, chur.lee, cm224.lee, jooyoung.hwang,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

On Thu, 18 Oct 2012 17:39:11 +0400 Vyacheslav Dubeyko <slava@dubeyko.com>
wrote:

> [snip]
> > > 
> > > And Would you share ppt or document of f2fs if Korea Linux Forum is finished ?
> > > 
> > 
> > Here I attached the slides, and LF will also share the slides.
> > Thanks,
> > 
> 
> I had hope that slides will have more detailed description. Maybe it is
> good for Linux Forum. But do you plan to publish more detailed
> description of F2FS architecture, advantages/disadvantages in the form
> of article? It makes sense from my point of view.

<plug>
https://lwn.net/Articles/518988/
</plug>

:-)

NeilBrown

> 
> With the best regards,
> Vyacheslav Dubeyko.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-11 16:32                               ` 杨苏立 Yang Su Li
  (?)
  (?)
@ 2012-10-23 19:53                               ` Vladislav Bolkhovitin
  2012-10-24 21:17                                   ` Nico Williams
  2012-10-25  5:14                                 ` Theodore Ts'o
  -1 siblings, 2 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-23 19:53 UTC (permalink / raw)
  To: 杨苏立 Yang Su Li
  Cc: General Discussion of SQLite Database, linux-kernel, linux-fsdevel, drh

杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:
> I am not quite whether I should ask this question here, but in terms
> of light weight barrier/fsync, could anyone tell me why the device
> driver / OS provide the barrier interface other than some other
> abstractions anyway? I am sorry if this sounds like a stupid questions
> or it has been discussed before....
>
> I mean, most of the time, we only need some ordering in writes; not
> complete order, but partial,very simple topological order. And a
> barrier seems to be a heavy weighted solution to achieve this anyway:
> you have to finish all writes before the barrier, then start all
> writes issued after the barrier. That is some ordering which is much
> stronger than what we need, isn't it?
>
> As most of the time the order we need do not involve too many blocks
> (certainly a lot less than all the cached blocks in the system or in
> the disk's cache), that topological order isn't likely to be very
> complicated, and I image it could be implemented efficiently in a
> modern device, which already has complicated caching/garbage
> collection/whatever going on internally. Particularly, it seems not
> too hard to be implemented on top of SCSI's ordered/simple task mode?

Yes, SCSI has full support for ordered/simple commands designed exactly for that 
task: to have steady flow of commands even in case when some of them are ordered. 
It also has necessary facilities to handle commands errors without unexpected 
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage 
performance by fully "fill the pipe", using networking terms. I can easily imaging 
real life configs, where it can bring 2+ times more performance, than with queue 
flushing.

In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.

Implementation should be relatively easy as well, because all transports naturally 
have link as the point of serialization, so all you need in multithreaded 
environment is to pass some SN from the point when each ORDERED command created to 
the point when it sent to the link and make sure that no SIMPLE commands can ever 
cross ORDERED commands. You can see how it is implemented in SCST in an elegant 
and lockless manner (for SIMPLE commands).

But historically for some reason Linux storage developers were stuck with 
"barriers" concept, which is obviously not the same as ORDERED commands, hence had 
a lot troubles with their ambiguous semantic. As far as I can tell the reason of 
that was some lack of sufficiently deep SCSI understanding (how to handle errors, 
believe that ACA is something legacy from parallel SCSI times, etc.).

Hopefully, eventually the storage developers will realize the value behind ordered 
commands and learn corresponding SCSI facilities to deal with them. It's quite 
easy to demonstrate this value, if you know where to look at and not blindly 
refusing such possibility. I have already tried to explain it a couple of times, 
but was not successful.

Before that happens, people will keep returning again and again with those simple 
questions: why the queue must be flushed for any ordered operation? Isn't is an 
obvious overkill?

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-24 21:17                                   ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-24 21:17 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: 杨苏立 Yang Su Li, linux-fsdevel, linux-kernel, drh

On Tue, Oct 23, 2012 at 2:53 PM, Vladislav Bolkhovitin
<vvvvvst@gmail.com> wrote:
>> As most of the time the order we need do not involve too many blocks
>> (certainly a lot less than all the cached blocks in the system or in
>> the disk's cache), that topological order isn't likely to be very
>> complicated, and I image it could be implemented efficiently in a
>> modern device, which already has complicated caching/garbage
>> collection/whatever going on internally. Particularly, it seems not
>> too hard to be implemented on top of SCSI's ordered/simple task mode?

If you have multiple layers involved (e.g., SQLite then the
filesystem, and if the filesystem is spread over multiple storage
devices), and if transactions are not bounded, and on top of that if
there are other concurrent writers to the same filesystem (even if not
the same files) then the set of blocks to write and internal ordering
can get complex.  In practice filesystems try to break these up into
large self-consistent chunks and write those -- ZFS does this, for
example -- and this is aided by the lack of transactional semantics in
the filesystem.

For SQLite with a VFS that talks [i]SCSI directly then things could be
much more manageable as there's only one write transaction in progress
at any given time.  But that's not realistic, except, perhaps, in some
embedded systems.

> Yes, SCSI has full support for ordered/simple commands designed exactly for
> that task: [...]
>
> [...]
>
> But historically for some reason Linux storage developers were stuck with
> "barriers" concept, which is obviously not the same as ORDERED commands,
> hence had a lot troubles with their ambiguous semantic. As far as I can tell
> the reason of that was some lack of sufficiently deep SCSI understanding
> (how to handle errors, believe that ACA is something legacy from parallel
> SCSI times, etc.).

Barriers are a very simple abstraction, so there's that.

> Hopefully, eventually the storage developers will realize the value behind
> ordered commands and learn corresponding SCSI facilities to deal with them.
> It's quite easy to demonstrate this value, if you know where to look at and
> not blindly refusing such possibility. I have already tried to explain it a
> couple of times, but was not successful.

Exposing ordering of lower-layer operations to filesystem applications
is a non-starter.  About the only reasonable thing to do with a
filesystem is add barrier operations.  I know, you're talking about
lower layer capabilities, and SQLite could talk to that layer
directly, but let's face it: it's not likely to.

> Before that happens, people will keep returning again and again with those
> simple questions: why the queue must be flushed for any ordered operation?
> Isn't is an obvious overkill?

That [cache flushing] is not what's being asked for here.  Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-24 21:17                                   ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-24 21:17 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, drh-X1OJI8nnyKUAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 23, 2012 at 2:53 PM, Vladislav Bolkhovitin
<vvvvvst-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> As most of the time the order we need do not involve too many blocks
>> (certainly a lot less than all the cached blocks in the system or in
>> the disk's cache), that topological order isn't likely to be very
>> complicated, and I image it could be implemented efficiently in a
>> modern device, which already has complicated caching/garbage
>> collection/whatever going on internally. Particularly, it seems not
>> too hard to be implemented on top of SCSI's ordered/simple task mode?

If you have multiple layers involved (e.g., SQLite then the
filesystem, and if the filesystem is spread over multiple storage
devices), and if transactions are not bounded, and on top of that if
there are other concurrent writers to the same filesystem (even if not
the same files) then the set of blocks to write and internal ordering
can get complex.  In practice filesystems try to break these up into
large self-consistent chunks and write those -- ZFS does this, for
example -- and this is aided by the lack of transactional semantics in
the filesystem.

For SQLite with a VFS that talks [i]SCSI directly then things could be
much more manageable as there's only one write transaction in progress
at any given time.  But that's not realistic, except, perhaps, in some
embedded systems.

> Yes, SCSI has full support for ordered/simple commands designed exactly for
> that task: [...]
>
> [...]
>
> But historically for some reason Linux storage developers were stuck with
> "barriers" concept, which is obviously not the same as ORDERED commands,
> hence had a lot troubles with their ambiguous semantic. As far as I can tell
> the reason of that was some lack of sufficiently deep SCSI understanding
> (how to handle errors, believe that ACA is something legacy from parallel
> SCSI times, etc.).

Barriers are a very simple abstraction, so there's that.

> Hopefully, eventually the storage developers will realize the value behind
> ordered commands and learn corresponding SCSI facilities to deal with them.
> It's quite easy to demonstrate this value, if you know where to look at and
> not blindly refusing such possibility. I have already tried to explain it a
> couple of times, but was not successful.

Exposing ordering of lower-layer operations to filesystem applications
is a non-starter.  About the only reasonable thing to do with a
filesystem is add barrier operations.  I know, you're talking about
lower layer capabilities, and SQLite could talk to that layer
directly, but let's face it: it's not likely to.

> Before that happens, people will keep returning again and again with those
> simple questions: why the queue must be flushed for any ordered operation?
> Isn't is an obvious overkill?

That [cache flushing] is not what's being asked for here.  Just a
light-weight barrier.  My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed.  This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-24 21:17                                   ` Nico Williams
  (?)
@ 2012-10-24 22:03                                   ` david
  2012-10-25  0:20                                       ` Nico Williams
  2012-10-25  5:42                                     ` Theodore Ts'o
  -1 siblings, 2 replies; 154+ messages in thread
From: david @ 2012-10-24 22:03 UTC (permalink / raw)
  To: Nico Williams
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, 24 Oct 2012, Nico Williams wrote:

>> Before that happens, people will keep returning again and again with those
>> simple questions: why the queue must be flushed for any ordered operation?
>> Isn't is an obvious overkill?
>
> That [cache flushing] is not what's being asked for here.  Just a
> light-weight barrier.  My proposal works without having to add new
> system calls: a) use a COW format, b) have background threads doing
> fsync()s, c) in each transaction's root block note the last
> known-committed (from a completed fsync()) transaction's root block,
> d) have an array of well-known ubberblocks large enough to accommodate
> as many transactions as possible without having to wait for any one
> fsync() to complete, d) do not reclaim space from any one past
> transaction until at least one subsequent transaction is fully
> committed.  This obtains ACI- transaction semantics (survives power
> failures but without durability for the last N transactions at
> power-failure time) without requiring changes to the OS at all, and
> with support for delayed D (durability) notification.

I'm doing some work with rsyslog and it's disk-baded queues and there is a 
similar issue there. The good news is that we can have a version that is 
linux specific (rsyslog is used on other OSs, but there is an existing 
queue implementation that they can use, if the faster one is linux-only, 
but is significantly faster, that's just a win for Linux)

Like what is being described for sqlite, loosing the tail end of the 
messages is not a big problem under normal conditions. But there is a need 
to be sure that what is there is complete up to the point where it's lost.

this is similar in concept to write-ahead-logs done for databases (without 
the absolute durability requirement)

1. new messages arrive and get added to the end of the queue file.

2. a thread updates the queue to indicate that it is in the process 
of delivering a block of messages

3. the thread updates the queue to indicate that the block of messages has 
been delivered

4. garbage collection happens to delete the old messages to free up space 
(if queues go into files, this can just be to limit the file size, 
spilling to multiple files, and when an old file is completely marked as 
delivered, delete it)

I am not fully understanding how what you are describing (COW, separate 
fsync threads, etc) would be implemented on top of existing filesystems. 
Most of what you are describing seems like it requires access to the 
underlying storage to implement.

could you give a more detailed explination?

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-25  0:20                                       ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-25  0:20 UTC (permalink / raw)
  To: david
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, Oct 24, 2012 at 5:03 PM,  <david@lang.hm> wrote:
> I'm doing some work with rsyslog and it's disk-baded queues and there is a
> similar issue there. The good news is that we can have a version that is
> linux specific (rsyslog is used on other OSs, but there is an existing queue
> implementation that they can use, if the faster one is linux-only, but is
> significantly faster, that's just a win for Linux)
>
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is a need
> to be sure that what is there is complete up to the point where it's lost.
>
> this is similar in concept to write-ahead-logs done for databases (without
> the absolute durability requirement)
>
> [...]
>
> I am not fully understanding how what you are describing (COW, separate
> fsync threads, etc) would be implemented on top of existing filesystems.
> Most of what you are describing seems like it requires access to the
> underlying storage to implement.
>
> could you give a more detailed explination?

COW is "copy on write", which is actually a bit of a misnomer -- all
COW means is that blocks aren't over-written, instead new blocks are
written.  In particular this means that inodes, indirect blocks, data
blocks, and so on, that are changed are actually written to new
locations, and the on-disk format needs to handle this indirection.

As for fsyn() and background threads... fsync() is synchronous, but in
this scheme we want it to happen asynchronously and then we want to
update each transaction with a pointer to the last transaction that is
known stable given an fsync()'s return.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-25  0:20                                       ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-25  0:20 UTC (permalink / raw)
  To: david-gFPdbfVZQbY
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 24, 2012 at 5:03 PM,  <david-gFPdbfVZQbY@public.gmane.org> wrote:
> I'm doing some work with rsyslog and it's disk-baded queues and there is a
> similar issue there. The good news is that we can have a version that is
> linux specific (rsyslog is used on other OSs, but there is an existing queue
> implementation that they can use, if the faster one is linux-only, but is
> significantly faster, that's just a win for Linux)
>
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is a need
> to be sure that what is there is complete up to the point where it's lost.
>
> this is similar in concept to write-ahead-logs done for databases (without
> the absolute durability requirement)
>
> [...]
>
> I am not fully understanding how what you are describing (COW, separate
> fsync threads, etc) would be implemented on top of existing filesystems.
> Most of what you are describing seems like it requires access to the
> underlying storage to implement.
>
> could you give a more detailed explination?

COW is "copy on write", which is actually a bit of a misnomer -- all
COW means is that blocks aren't over-written, instead new blocks are
written.  In particular this means that inodes, indirect blocks, data
blocks, and so on, that are changed are actually written to new
locations, and the on-disk format needs to handle this indirection.

As for fsyn() and background threads... fsync() is synchronous, but in
this scheme we want it to happen asynchronously and then we want to
update each transaction with a pointer to the last transaction that is
known stable given an fsync()'s return.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  0:20                                       ` Nico Williams
  (?)
@ 2012-10-25  1:04                                       ` david
  2012-10-25  5:18                                           ` Nico Williams
  -1 siblings, 1 reply; 154+ messages in thread
From: david @ 2012-10-25  1:04 UTC (permalink / raw)
  To: Nico Williams
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, 24 Oct 2012, Nico Williams wrote:

> On Wed, Oct 24, 2012 at 5:03 PM,  <david@lang.hm> wrote:
>> I'm doing some work with rsyslog and it's disk-baded queues and there is a
>> similar issue there. The good news is that we can have a version that is
>> linux specific (rsyslog is used on other OSs, but there is an existing queue
>> implementation that they can use, if the faster one is linux-only, but is
>> significantly faster, that's just a win for Linux)
>>
>> Like what is being described for sqlite, loosing the tail end of the
>> messages is not a big problem under normal conditions. But there is a need
>> to be sure that what is there is complete up to the point where it's lost.
>>
>> this is similar in concept to write-ahead-logs done for databases (without
>> the absolute durability requirement)
>>
>> [...]
>>
>> I am not fully understanding how what you are describing (COW, separate
>> fsync threads, etc) would be implemented on top of existing filesystems.
>> Most of what you are describing seems like it requires access to the
>> underlying storage to implement.
>>
>> could you give a more detailed explination?
>
> COW is "copy on write", which is actually a bit of a misnomer -- all
> COW means is that blocks aren't over-written, instead new blocks are
> written.  In particular this means that inodes, indirect blocks, data
> blocks, and so on, that are changed are actually written to new
> locations, and the on-disk format needs to handle this indirection.

so how can you do this, and keep the writes in order (especially between 
two files) without being the filesystem?

> As for fsyn() and background threads... fsync() is synchronous, but in
> this scheme we want it to happen asynchronously and then we want to
> update each transaction with a pointer to the last transaction that is
> known stable given an fsync()'s return.

If you could specify ordering between two writes, I could see a process 
along the lines of

Append new message to file1

append tiny status updates to file2

every million messages, move to new files. once the last message has been 
processed for the old set of files, delete them.

since file2 is small, you can reconstruct state fairly cheaply

But unless you are a filesystem, how can you make sure that the message 
data is written to file1 before you write the metadata about the message 
to file2?

right now it seems that there is no way for an application to do this 
other than doing a fsync(file1) before writing the metadata to file2

And there is no way for the application to tell the filesystem to write 
the data in file2 in order (to make sure that block 3 is not written and 
then have the system crash before block 2 is written), so the application 
needs to do frequent fsync(file2) calls.

If you need complete durability of your data, there are well documented 
ways of enforcing it (including the lwn.net article 
http://lwn.net/Articles/457667/ )

But if you don't need the gurantee that your data is on disk now, you just 
need to have it ordered so that if you crash you can be guaranteed only to 
loose data off of the tail of your file, there doesn't seem to be any way 
to do this other than using the fsync() hammer and wait for the overhead 
of forcing the data to disk now.


Or, as I type this, it occurs to me that you may be saying that every time 
you want to do an ordering guarantee, spawn a new thread to do the fsync 
and then just keep processing. The fsync will happen at some point, and 
the writes will not be re-ordered across the fsync, but you can keep 
going, writing more data while the fsync's are pending.

Then if you have a filesystem and I/O subsystem that can consolodate the 
fwyncs from all the different threads together into one I/O operation 
without having to flush the entire I/O queue for each one, you can get 
acceptable performance, with ordering. If the system crashes, data that 
hasn't had it's fsync() complete will be the only thing that is lost.

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-23 19:53                               ` Vladislav Bolkhovitin
  2012-10-24 21:17                                   ` Nico Williams
@ 2012-10-25  5:14                                 ` Theodore Ts'o
  2012-10-25 13:03                                   ` Alan Cox
  2012-10-27  1:54                                   ` Vladislav Bolkhovitin
  1 sibling, 2 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25  5:14 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
> Yes, SCSI has full support for ordered/simple commands designed
> exactly for that task: to have steady flow of commands even in case
> when some of them are ordered.....

SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ).  Not all devices do.

More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive "enterprise
drives", it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ.  SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.

Yes, you can turn off writeback caching, but that has pretty huge
performance costs; and there is the FUA bit, but that's just an
unconditional high priority bypass of the writeback cache, which is
useful in some cases, but which again, does not give the ability for
the OS to specify a partial order, while letting the drive reorder
other requests for efficiency/performance's sake, since the drive has
a lot more information about the optimal way to reorder requests based
on the current location of the drive head and where certain blocks may
have been remapped due to bad block sparing, etc.

> Hopefully, eventually the storage developers will realize the value
> behind ordered commands and learn corresponding SCSI facilities to
> deal with them.

Eventually, drive manufacturers will realize that trying to price
guage people who want advanced features such as TCQ, DIF/DIX, is the
best way to gaurantee that most people won't bother to purchase them,
and hence the features will remain largely unused....

    	      	       	    	   	   - Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-25  5:18                                           ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-25  5:18 UTC (permalink / raw)
  To: david
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, Oct 24, 2012 at 8:04 PM,  <david@lang.hm> wrote:
> On Wed, 24 Oct 2012, Nico Williams wrote:
>> COW is "copy on write", which is actually a bit of a misnomer -- all
>> COW means is that blocks aren't over-written, instead new blocks are
>> written.  In particular this means that inodes, indirect blocks, data
>> blocks, and so on, that are changed are actually written to new
>> locations, and the on-disk format needs to handle this indirection.
>
> so how can you do this, and keep the writes in order (especially between two
> files) without being the filesystem?

By trusting fsync().  And if you don't care about immediate Durability
you can run the fsync() in a background thread and mark the associated
transaction as completed in the next transaction to be written after
the fsync() completes.

>> As for fsyn() and background threads... fsync() is synchronous, but in
>> this scheme we want it to happen asynchronously and then we want to
>> update each transaction with a pointer to the last transaction that is
>> known stable given an fsync()'s return.
>
> If you could specify ordering between two writes, I could see a process
> along the lines of
>
> [...]

fsync() deals with just one file.  fsync()s of different files are
another story.  That said, as long as the format of the two files is
COW then you can still compose transactions involving two files.  The
key is the file contents itself must be COW-structured.

Incidentally, here's a single-file, bag of b-trees that uses a COW
format: MDB, which can be found in
git://git.openldap.org/openldap.git, in the mdb.master branch.

> Or, as I type this, it occurs to me that you may be saying that every time
> you want to do an ordering guarantee, spawn a new thread to do the fsync and
> then just keep processing. The fsync will happen at some point, and the
> writes will not be re-ordered across the fsync, but you can keep going,
> writing more data while the fsync's are pending.

Yes, but only if the file's format is COWish.

The point is that COW saves the day.  A file-based DB needs to be COW.
 And the filesystem needs to be as well.

Note that write ahead logging approximates COW well enough most of the time.

> Then if you have a filesystem and I/O subsystem that can consolodate the
> fwyncs from all the different threads together into one I/O operation
> without having to flush the entire I/O queue for each one, you can get
> acceptable performance, with ordering. If the system crashes, data that
> hasn't had it's fsync() complete will be the only thing that is lost.

With the above caveat, yes.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-25  5:18                                           ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-25  5:18 UTC (permalink / raw)
  To: david-gFPdbfVZQbY
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel

On Wed, Oct 24, 2012 at 8:04 PM,  <david-gFPdbfVZQbY@public.gmane.org> wrote:
> On Wed, 24 Oct 2012, Nico Williams wrote:
>> COW is "copy on write", which is actually a bit of a misnomer -- all
>> COW means is that blocks aren't over-written, instead new blocks are
>> written.  In particular this means that inodes, indirect blocks, data
>> blocks, and so on, that are changed are actually written to new
>> locations, and the on-disk format needs to handle this indirection.
>
> so how can you do this, and keep the writes in order (especially between two
> files) without being the filesystem?

By trusting fsync().  And if you don't care about immediate Durability
you can run the fsync() in a background thread and mark the associated
transaction as completed in the next transaction to be written after
the fsync() completes.

>> As for fsyn() and background threads... fsync() is synchronous, but in
>> this scheme we want it to happen asynchronously and then we want to
>> update each transaction with a pointer to the last transaction that is
>> known stable given an fsync()'s return.
>
> If you could specify ordering between two writes, I could see a process
> along the lines of
>
> [...]

fsync() deals with just one file.  fsync()s of different files are
another story.  That said, as long as the format of the two files is
COW then you can still compose transactions involving two files.  The
key is the file contents itself must be COW-structured.

Incidentally, here's a single-file, bag of b-trees that uses a COW
format: MDB, which can be found in
git://git.openldap.org/openldap.git, in the mdb.master branch.

> Or, as I type this, it occurs to me that you may be saying that every time
> you want to do an ordering guarantee, spawn a new thread to do the fsync and
> then just keep processing. The fsync will happen at some point, and the
> writes will not be re-ordered across the fsync, but you can keep going,
> writing more data while the fsync's are pending.

Yes, but only if the file's format is COWish.

The point is that COW saves the day.  A file-based DB needs to be COW.
 And the filesystem needs to be as well.

Note that write ahead logging approximates COW well enough most of the time.

> Then if you have a filesystem and I/O subsystem that can consolodate the
> fwyncs from all the different threads together into one I/O operation
> without having to flush the entire I/O queue for each one, you can get
> acceptable performance, with ordering. If the system crashes, data that
> hasn't had it's fsync() complete will be the only thing that is lost.

With the above caveat, yes.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-24 22:03                                   ` [sqlite] " david
  2012-10-25  0:20                                       ` Nico Williams
@ 2012-10-25  5:42                                     ` Theodore Ts'o
  2012-10-25  7:11                                       ` david
  1 sibling, 1 reply; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25  5:42 UTC (permalink / raw)
  To: david
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, Oct 24, 2012 at 03:03:00PM -0700, david@lang.hm wrote:
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is
> a need to be sure that what is there is complete up to the point
> where it's lost.
> 
> this is similar in concept to write-ahead-logs done for databases
> (without the absolute durability requirement)

If that's what you require, and you are using ext3/4, usng data
journalling might meet your requirements.  It's something you can
enable on a per-file basis, via chattr +j; you don't have to force all
file systems to use data journaling via the data=journalled mount
option.

The potential downsides that you may or may not care about for this
particular application:

(a) This will definitely have a performance impact, especially if you
are doing lots of small (less than 4k) writes, since the data blocks
will get run through the journal, and will only get written to their
final location on disk.

(b) You don't get atomicity if the write spans a 4k block boundary.
All of the bytes before i_size will be written, so you don't have to
worry about "holes"; but the last message written to the log file
might be truncated.

(c) There will be a performance impact, since the contents of data
blocks will be written at least twice (once to the journal, and once
to the final location on disk).  If you do lots of small, sub-4k
writes, the performance might be even worse, since data blocks might
be written multiple times to the journal.

						- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  5:18                                           ` Nico Williams
  (?)
@ 2012-10-25  6:02                                           ` Theodore Ts'o
  2012-10-25  6:58                                             ` david
  2012-10-30 23:49                                             ` [sqlite] " Nico Williams
  -1 siblings, 2 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25  6:02 UTC (permalink / raw)
  To: Nico Williams
  Cc: david, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
> 
> By trusting fsync().  And if you don't care about immediate Durability
> you can run the fsync() in a background thread and mark the associated
> transaction as completed in the next transaction to be written after
> the fsync() completes.

The challenge is when you have entagled metadata updates.  That is,
you update file A, and file B, and file A and B might share metadata.
In order to sync file A, you also have to update part of the metadata
for the updates to file B, which means calculating the dependencies of
what you have to drag in can get very complicated.  You can keep track
of what bits of the metadata you have to undo and then redo before
writing out the metadata for fsync(A), but that basically means you
have to implement soft updates, and all of the complexity this
implies: http://lwn.net/Articles/339337/

If you can keep all of the metadata separate, this can be somewhat
mitigated, but usually the block allocation records (regardless of
whether you use a tree, or a bitmap, or some other data structure)
tends of have entanglement problems.

It certainly is not impossible; RDBMS's have implemented this.  On the
other hand, they generally aren't as fast as file systems for
non-transactional workloads, and people really care about performance
on those sorts of workloads for file systems.  (About a decade ago,
Oracle tried to claim that you could run file system workloads using
an Oracle databsae as a back-end.  Everyone laughed at them, and the
idea died a quick, merciful death.)

Still, if you want to try to implement such a thing, by all means,
give it a try.  But I think you'll find that creating a file system
that can compete with existing file systems for performance, and
*then* also supports a transactional model, is going to be quite a
challenge.

     	      		      	     	      	 - Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  6:02                                           ` [sqlite] " Theodore Ts'o
@ 2012-10-25  6:58                                             ` david
  2012-10-25 14:03                                                 ` Theodore Ts'o
  2012-10-30 23:49                                             ` [sqlite] " Nico Williams
  1 sibling, 1 reply; 154+ messages in thread
From: david @ 2012-10-25  6:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, 25 Oct 2012, Theodore Ts'o wrote:

> On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
>>
>> By trusting fsync().  And if you don't care about immediate Durability
>> you can run the fsync() in a background thread and mark the associated
>> transaction as completed in the next transaction to be written after
>> the fsync() completes.
>
> The challenge is when you have entagled metadata updates.  That is,
> you update file A, and file B, and file A and B might share metadata.
> In order to sync file A, you also have to update part of the metadata
> for the updates to file B, which means calculating the dependencies of
> what you have to drag in can get very complicated.  You can keep track
> of what bits of the metadata you have to undo and then redo before
> writing out the metadata for fsync(A), but that basically means you
> have to implement soft updates, and all of the complexity this
> implies: http://lwn.net/Articles/339337/
>
> If you can keep all of the metadata separate, this can be somewhat
> mitigated, but usually the block allocation records (regardless of
> whether you use a tree, or a bitmap, or some other data structure)
> tends of have entanglement problems.

hmm, two thoughts occur to me.

1. to avoid entanglement, put the two files in separate directories

2. take advantage of entaglement to enforce ordering


thread 1 (repeated): write new message to file 1, spawn new thread to 
fsync

thread 2: write to file 2 that message1-5 are being worked on

thread 2 (later): write to file 2 that messages 1-5 are done

when thread 1 spawns the new thread to do the fsync, the system will be 
forced to write the data to file 2 as of the time it does the fsync.

This should make it so that you never have data written to file2 that 
refers to data that hasn't been written to file1 yet.


> It certainly is not impossible; RDBMS's have implemented this.  On the
> other hand, they generally aren't as fast as file systems for
> non-transactional workloads, and people really care about performance
> on those sorts of workloads for file systems.

the RDBMS's have implemented stronger guarantees than what we are needing

A few years ago I was investigating this for logging. With the reliable 
(RDBMS style) , but inefficent disk queue that rsyslog has, writing to a 
high-end fusion-io SSD, ext2 resulted in ~8K logs/sec, ext3 resultedin ~2K 
logs/sec, and JFS/XFS resulted in ~4K logs/sec (ext4 wasn't considered 
stable enough at the time to be tested)

> Still, if you want to try to implement such a thing, by all means,
> give it a try.  But I think you'll find that creating a file system
> that can compete with existing file systems for performance, and
> *then* also supports a transactional model, is going to be quite a
> challenge.

The question is trying to figure a way to get ordering right with existing 
filesystms (preferrably without using something too tied to a single 
filesystem implementation), not try and create a new one.

The frustrating thing is that when people point out how things like sqlite 
are so horribly slow, the reply seems to be "well, that's what you get for 
doing so many fsyncs, don't do that", when there is a 'problem' like the 
KDE "config loss" problem a few years ago, the response is "well, that's 
what you get for not doing fsync"

Both responses are correct, from a purely technical point of view.

But what's missing is any way to get the result of ordered I/O that will 
let you do something pretty fast, but with the guarantee that, if you 
loose data in a crash, the only loss you are risking is that your most 
recent data may be missing. (either for one file, or using multiple files 
if that's what it takes)

Since this topic came up again, I figured I'd poke a bit and try to either 
get educated on how to do this "right" or try and see if there's something 
that could be added to the kernel to make it possible for userspace 
programs to do this.

What I think userspace really needs is something like a barrier function 
call. "for this fd, don't re-order writes as they go down through the 
stack"

If the hardware is going to reorder things once it hits the hardware, this 
is going to hurt performance (how much depends on a lot of stuff)

but the filesystems are able to make their journals work, so there should 
be some way to let userspace do some sort of similar ordering

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  5:42                                     ` Theodore Ts'o
@ 2012-10-25  7:11                                       ` david
  0 siblings, 0 replies; 154+ messages in thread
From: david @ 2012-10-25  7:11 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, 25 Oct 2012, Theodore Ts'o wrote:

> On Wed, Oct 24, 2012 at 03:03:00PM -0700, david@lang.hm wrote:
>> Like what is being described for sqlite, loosing the tail end of the
>> messages is not a big problem under normal conditions. But there is
>> a need to be sure that what is there is complete up to the point
>> where it's lost.
>>
>> this is similar in concept to write-ahead-logs done for databases
>> (without the absolute durability requirement)
>
> If that's what you require, and you are using ext3/4, usng data
> journalling might meet your requirements.  It's something you can
> enable on a per-file basis, via chattr +j; you don't have to force all
> file systems to use data journaling via the data=journalled mount
> option.
>
> The potential downsides that you may or may not care about for this
> particular application:
>
> (a) This will definitely have a performance impact, especially if you
> are doing lots of small (less than 4k) writes, since the data blocks
> will get run through the journal, and will only get written to their
> final location on disk.
>
> (b) You don't get atomicity if the write spans a 4k block boundary.
> All of the bytes before i_size will be written, so you don't have to
> worry about "holes"; but the last message written to the log file
> might be truncated.
>
> (c) There will be a performance impact, since the contents of data
> blocks will be written at least twice (once to the journal, and once
> to the final location on disk).  If you do lots of small, sub-4k
> writes, the performance might be even worse, since data blocks might
> be written multiple times to the journal.

I'll have to dig into this option. In the case of rsyslog it sounds 
like it could work (not as good as a filesystem independant way of doing 
things, but better than full fsyncs)

Truncated messages are not great, but they are a detectable, and 
acceptable risk.

while the average message size is much smaller than 4K (on my network it's 
~250 bytes), the metadata that's broken out expands this somewhat, and we 
can afford to waste disk space if it makes things safer or more efficient.

If we do update in place with flags with each message, each message will 
need to be written up to three times (on recipt, being processed, finished 
processed). With high message burst rates, I'm worried that we would fill 
up the journal, is there a good way to deal with this?

I believe that ext4 can put the journal on a different device from the 
filesystem, would this help a lot?

If you were to put the journal for an ext4 filesystem on a ram disk, you 
would loose the data recovery protection of the journal, but could you use 
this trick to get ordered data writes onto the filesystem?

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  5:14                                 ` Theodore Ts'o
@ 2012-10-25 13:03                                   ` Alan Cox
  2012-10-25 13:50                                       ` Theodore Ts'o
  2012-10-27  1:54                                   ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 154+ messages in thread
From: Alan Cox @ 2012-10-25 13:03 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Vladislav Bolkhovitin, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

> > Hopefully, eventually the storage developers will realize the value
> > behind ordered commands and learn corresponding SCSI facilities to
> > deal with them.
> 
> Eventually, drive manufacturers will realize that trying to price
> guage people who want advanced features such as TCQ, DIF/DIX, is the
> best way to gaurantee that most people won't bother to purchase them,
> and hence the features will remain largely unused....

I doubt they care. The profit on high end features from the people who
really need them I would bet far exceeds any other benefit of giving it to
others. Welcome to capitalism 8)

Plus - spinning rust for those end users is on the way out, SATA to flash
is a bit of hack and people are already putting a lot of focus onto
things like NVM Express.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-25 13:50                                       ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25 13:50 UTC (permalink / raw)
  To: Alan Cox
  Cc: Vladislav Bolkhovitin, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

On Thu, Oct 25, 2012 at 02:03:25PM +0100, Alan Cox wrote:
> 
> I doubt they care. The profit on high end features from the people who
> really need them I would bet far exceeds any other benefit of giving it to
> others. Welcome to capitalism 8)

Yes, but it's a question of pricing.  If they had priced it a just a
wee bit higher, then there would have been incentive to add support
for TCQ so it could actually be used into various Linux file systems,
since there would have been lots of users of it.  But as it is, the
folks who are purchasing huge, vast number of these drives --- such as
at the large cloud providers: Amazon, Facebook, Racespace, et. al. ---
will choose to purchase large numbers of commodity drives, and then
find ways to work around the missing functionality in userspace.  For
example, DIF/DIX would be nice, and if it were available for cheap, I
could imagine it being used.  But you can accomplish the same thing in
userspace, and in fact at Google I've implemented a special
not-for-mainline patch which spikes out stable writes (required for
DIF/DIX) because it has significant performance overhead, and DIF/DIX
has zero benefit if you're not willing to shell out $$$ for hardware
that supports it.

Maybe the HDD manufacturers have been able to price guage a small
number enterprise I/T shops with more dollars than sense, but
personally, I'm not convinced they picked an optimal pricing
strategy....

Put another way, I accept that Toyota should price a Lexus ES more
than a Camry, but if it's priced at say, 3x the price of a Camry
instead of 20%, they might find that precious few people are willing
to pay that kind of money for what is essentially the same car with
minor luxury tweaks added to it.

> Plus - spinning rust for those end users is on the way out, SATA to flash
> is a bit of hack and people are already putting a lot of focus onto
> things like NVM Express.

Yeah....  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed....

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.

      	 	       		    	      - Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-25 13:50                                       ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25 13:50 UTC (permalink / raw)
  To: Alan Cox
  Cc: General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, Vladislav Bolkhovitin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 25, 2012 at 02:03:25PM +0100, Alan Cox wrote:
> 
> I doubt they care. The profit on high end features from the people who
> really need them I would bet far exceeds any other benefit of giving it to
> others. Welcome to capitalism 8)

Yes, but it's a question of pricing.  If they had priced it a just a
wee bit higher, then there would have been incentive to add support
for TCQ so it could actually be used into various Linux file systems,
since there would have been lots of users of it.  But as it is, the
folks who are purchasing huge, vast number of these drives --- such as
at the large cloud providers: Amazon, Facebook, Racespace, et. al. ---
will choose to purchase large numbers of commodity drives, and then
find ways to work around the missing functionality in userspace.  For
example, DIF/DIX would be nice, and if it were available for cheap, I
could imagine it being used.  But you can accomplish the same thing in
userspace, and in fact at Google I've implemented a special
not-for-mainline patch which spikes out stable writes (required for
DIF/DIX) because it has significant performance overhead, and DIF/DIX
has zero benefit if you're not willing to shell out $$$ for hardware
that supports it.

Maybe the HDD manufacturers have been able to price guage a small
number enterprise I/T shops with more dollars than sense, but
personally, I'm not convinced they picked an optimal pricing
strategy....

Put another way, I accept that Toyota should price a Lexus ES more
than a Camry, but if it's priced at say, 3x the price of a Camry
instead of 20%, they might find that precious few people are willing
to pay that kind of money for what is essentially the same car with
minor luxury tweaks added to it.

> Plus - spinning rust for those end users is on the way out, SATA to flash
> is a bit of hack and people are already putting a lot of focus onto
> things like NVM Express.

Yeah....  I don't buy that.  One, flash is still too expensive.  Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.

If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed....

Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.

      	 	       		    	      - Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-25 14:03                                                 ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25 14:03 UTC (permalink / raw)
  To: david
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Wed, Oct 24, 2012 at 11:58:49PM -0700, david@lang.hm wrote:
> The frustrating thing is that when people point out how things like
> sqlite are so horribly slow, the reply seems to be "well, that's
> what you get for doing so many fsyncs, don't do that", when there is
> a 'problem' like the KDE "config loss" problem a few years ago, the
> response is "well, that's what you get for not doing fsync"

Sure... but the answer is to only do the fsync's when you need to.
For example, if GNOME and KDE is rewriting the entire registry file
each time the application is changing a single registry key, sure, if
you rewrite the entire registry file, and then fsync after each
rewrite before you replace the file, you will be safe.  And if the
application needs to update dozens or hundreds of registry keys (or
every time the window gets moved or resized), then yes, it will be
slow.  But the application didn't have to do that!  It could have
updated all the registry keys in memory, and then update the registry
file periodically instead.

Similarly, Firefox didn't need to do a sqllite commit after every
single time its history file was written, causing a third of a
megabyte of write traffic each time you clicked on a web page.  It
could have batched its updates to the history file, since most of the
time, you don't care about making sure the web history is written to
stable store before you're allowed to click on a web page and visit
the next web page.

Or does rsyslog *really* need to issue an fsync after each log
message?  Or could it batch updates so that every N seconds, it
flushes writes to the disk?

(And this is a problem with most Android applications as well.
Apparently the framework API's are such that it's easier for an
application to treat each sqlite statement as an atomic update, so
many/most application writers don't use explicit transaction
boundaries, so updates don't get batched even though it would be more
efficient if they did so.)

Sometimes, the answer is not to try to create exotic database like
functionality in the file system --- the answer is to be more
intelligent at the application leyer.  Not only will the application
be more portable, it will also in the end be more efficient, since
even with the most exotic database technologies, the most efficient
transactional commit is the unneeded commit that you optimize away at
the application layer.

						- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-25 14:03                                                 ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25 14:03 UTC (permalink / raw)
  To: david-gFPdbfVZQbY
  Cc: General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 24, 2012 at 11:58:49PM -0700, david-gFPdbfVZQbY@public.gmane.org wrote:
> The frustrating thing is that when people point out how things like
> sqlite are so horribly slow, the reply seems to be "well, that's
> what you get for doing so many fsyncs, don't do that", when there is
> a 'problem' like the KDE "config loss" problem a few years ago, the
> response is "well, that's what you get for not doing fsync"

Sure... but the answer is to only do the fsync's when you need to.
For example, if GNOME and KDE is rewriting the entire registry file
each time the application is changing a single registry key, sure, if
you rewrite the entire registry file, and then fsync after each
rewrite before you replace the file, you will be safe.  And if the
application needs to update dozens or hundreds of registry keys (or
every time the window gets moved or resized), then yes, it will be
slow.  But the application didn't have to do that!  It could have
updated all the registry keys in memory, and then update the registry
file periodically instead.

Similarly, Firefox didn't need to do a sqllite commit after every
single time its history file was written, causing a third of a
megabyte of write traffic each time you clicked on a web page.  It
could have batched its updates to the history file, since most of the
time, you don't care about making sure the web history is written to
stable store before you're allowed to click on a web page and visit
the next web page.

Or does rsyslog *really* need to issue an fsync after each log
message?  Or could it batch updates so that every N seconds, it
flushes writes to the disk?

(And this is a problem with most Android applications as well.
Apparently the framework API's are such that it's easier for an
application to treat each sqlite statement as an atomic update, so
many/most application writers don't use explicit transaction
boundaries, so updates don't get batched even though it would be more
efficient if they did so.)

Sometimes, the answer is not to try to create exotic database like
functionality in the file system --- the answer is to be more
intelligent at the application leyer.  Not only will the application
be more portable, it will also in the end be more efficient, since
even with the most exotic database technologies, the most efficient
transactional commit is the unneeded commit that you optimize away at
the application layer.

						- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-25 18:03                                                   ` david-gFPdbfVZQbY
  0 siblings, 0 replies; 154+ messages in thread
From: david @ 2012-10-25 18:03 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, 25 Oct 2012, Theodore Ts'o wrote:

> Or does rsyslog *really* need to issue an fsync after each log
> message?  Or could it batch updates so that every N seconds, it
> flushes writes to the disk?

In part this depends on how paranoid the admin is. By default rsyslog 
doesn't do fsyncs, but admins can configure it to do so and can configure 
the batch size.

However, what I'm talking about here is not normal message traffic, it's 
the case where the admin has decided that they don't want to use the 
normal inmemory queues, they want to have the queues be on disk so that if 
the system crashes the queued data will still be there to be processed 
after the crash (In addition, this can get used to cover cases where you 
want queue sizes larger than your available RAM)

In this case, the extreme, and only at the explicit direction of the 
admin, is to fsync after every message.

The norm is that it's acceptable to loose the last few messages, but 
loosing a chunk out of the middle of the queue file can cause a whole lot 
more to be lost, passing the threshold of acceptable.

> Sometimes, the answer is not to try to create exotic database like
> functionality in the file system --- the answer is to be more
> intelligent at the application leyer.  Not only will the application
> be more portable, it will also in the end be more efficient, since
> even with the most exotic database technologies, the most efficient
> transactional commit is the unneeded commit that you optimize away at
> the application layer.

I agree, this is why I'm trying to figure out the recommended way to do 
this without needing to do full commits.

Since in most cases it's acceptable to loose the last few chunks written, 
if we had some way of specifying ordering, without having to specify 
"write this NOW", the solution would be pretty obvious.

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-25 18:03                                                   ` david-gFPdbfVZQbY
  0 siblings, 0 replies; 154+ messages in thread
From: david-gFPdbfVZQbY @ 2012-10-25 18:03 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Thu, 25 Oct 2012, Theodore Ts'o wrote:

> Or does rsyslog *really* need to issue an fsync after each log
> message?  Or could it batch updates so that every N seconds, it
> flushes writes to the disk?

In part this depends on how paranoid the admin is. By default rsyslog 
doesn't do fsyncs, but admins can configure it to do so and can configure 
the batch size.

However, what I'm talking about here is not normal message traffic, it's 
the case where the admin has decided that they don't want to use the 
normal inmemory queues, they want to have the queues be on disk so that if 
the system crashes the queued data will still be there to be processed 
after the crash (In addition, this can get used to cover cases where you 
want queue sizes larger than your available RAM)

In this case, the extreme, and only at the explicit direction of the 
admin, is to fsync after every message.

The norm is that it's acceptable to loose the last few messages, but 
loosing a chunk out of the middle of the queue file can cause a whole lot 
more to be lost, passing the threshold of acceptable.

> Sometimes, the answer is not to try to create exotic database like
> functionality in the file system --- the answer is to be more
> intelligent at the application leyer.  Not only will the application
> be more portable, it will also in the end be more efficient, since
> even with the most exotic database technologies, the most efficient
> transactional commit is the unneeded commit that you optimize away at
> the application layer.

I agree, this is why I'm trying to figure out the recommended way to do 
this without needing to do full commits.

Since in most cases it's acceptable to loose the last few chunks written, 
if we had some way of specifying ordering, without having to specify 
"write this NOW", the solution would be pretty obvious.

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-25 18:29                                                     ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25 18:29 UTC (permalink / raw)
  To: david
  Cc: Nico Williams, General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Thu, Oct 25, 2012 at 11:03:13AM -0700, david@lang.hm wrote:
> I agree, this is why I'm trying to figure out the recommended way to
> do this without needing to do full commits.
> 
> Since in most cases it's acceptable to loose the last few chunks
> written, if we had some way of specifying ordering, without having
> to specify "write this NOW", the solution would be pretty obvious.

Well, using data journalling with ext3/4 may do what you want.  If you
don't do any fsync, the changes will get written every 5 seconds when
the automatic journal sync happens (and sub-4k writes will also get
coalesced to a 5 second granularity).  Even with plain text files,
it's pretty easy to tell whether or not the final record is a
partially written or not after a crash; just look for a trailing
newline.

Better yet, if you are writing to multiple log files with data
journalling, all of the writes will happen at the same time, and they
will be streamed to the file system journal, minimizing random writes
for at least the journal writes.

						- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-25 18:29                                                     ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-25 18:29 UTC (permalink / raw)
  To: david-gFPdbfVZQbY
  Cc: General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 25, 2012 at 11:03:13AM -0700, david-gFPdbfVZQbY@public.gmane.org wrote:
> I agree, this is why I'm trying to figure out the recommended way to
> do this without needing to do full commits.
> 
> Since in most cases it's acceptable to loose the last few chunks
> written, if we had some way of specifying ordering, without having
> to specify "write this NOW", the solution would be pretty obvious.

Well, using data journalling with ext3/4 may do what you want.  If you
don't do any fsync, the changes will get written every 5 seconds when
the automatic journal sync happens (and sub-4k writes will also get
coalesced to a 5 second granularity).  Even with plain text files,
it's pretty easy to tell whether or not the final record is a
partially written or not after a crash; just look for a trailing
newline.

Better yet, if you are writing to multiple log files with data
journalling, all of the writes will happen at the same time, and they
will be streamed to the file system journal, minimizing random writes
for at least the journal writes.

						- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-24 21:17                                   ` Nico Williams
  (?)
  (?)
@ 2012-10-27  1:52                                   ` Vladislav Bolkhovitin
  -1 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-27  1:52 UTC (permalink / raw)
  To: Nico Williams
  Cc: General Discussion of SQLite Database,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel,
	drh


Nico Williams, on 10/24/2012 05:17 PM wrote:
>> Yes, SCSI has full support for ordered/simple commands designed exactly for
>> that task: [...]
>>
>> [...]
>>
>> But historically for some reason Linux storage developers were stuck with
>> "barriers" concept, which is obviously not the same as ORDERED commands,
>> hence had a lot troubles with their ambiguous semantic. As far as I can tell
>> the reason of that was some lack of sufficiently deep SCSI understanding
>> (how to handle errors, believe that ACA is something legacy from parallel
>> SCSI times, etc.).
>
> Barriers are a very simple abstraction, so there's that.

It isn't simple at all. If you think for some time about barriers from the storage 
point of view, you will soon realize how bad and ambiguous they are.

>> Before that happens, people will keep returning again and again with those
>> simple questions: why the queue must be flushed for any ordered operation?
>> Isn't is an obvious overkill?
>
> That [cache flushing]

It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if 
you like.

Often there's a big difference where it's done: on the system side, or on the 
storage side.

Actually, performance improvements from NCQ in many cases are not because it 
allows the drive to reorder requests, as it's commonly thought, but because it 
allows to have internal drive's processing stages stay always busy without any 
idle time. Drives often have a long internal pipeline.. Hence the need to keep 
every stage of it always busy and hence why using ORDERED commands is important 
for performance.

> is not what's being asked for here. Just a
> light-weight barrier.  My proposal works without having to add new
> system calls: a) use a COW format, b) have background threads doing
> fsync()s, c) in each transaction's root block note the last
> known-committed (from a completed fsync()) transaction's root block,
> d) have an array of well-known ubberblocks large enough to accommodate
> as many transactions as possible without having to wait for any one
> fsync() to complete, d) do not reclaim space from any one past
> transaction until at least one subsequent transaction is fully
> committed.  This obtains ACI- transaction semantics (survives power
> failures but without durability for the last N transactions at
> power-failure time) without requiring changes to the OS at all, and
> with support for delayed D (durability) notification.

I believe what you really want is to be able to send to the storage a sequence of 
your favorite operations (FS operations, async IO operations, etc.) like:

Write back caching disabled:

data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...

Write back caching enabled:

data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21, 
..., data op2M, ...

Right?

(ORDERED means that it is guaranteed that this ordered command never in any 
circumstances will be executed before any previous command completed AND after any 
subsequent command completed.)

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  5:14                                 ` Theodore Ts'o
  2012-10-25 13:03                                   ` Alan Cox
@ 2012-10-27  1:54                                   ` Vladislav Bolkhovitin
  2012-10-27  4:44                                       ` Theodore Ts'o
       [not found]                                     ` <508B3EED.2080003-d+Crzxg7Rs0@public.gmane.org>
  1 sibling, 2 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-27  1:54 UTC (permalink / raw)
  To: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh


Theodore Ts'o, on 10/25/2012 01:14 AM wrote:
> On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
>> Yes, SCSI has full support for ordered/simple commands designed
>> exactly for that task: to have steady flow of commands even in case
>> when some of them are ordered.....
>
> SCSI does, yes --- *if* the device actually implements Tagged Command
> Queuing (TCQ).  Not all devices do.
>
> More importantly, SATA drives do *not* have this capability, and when
> you compare the price of SATA drives to uber-expensive "enterprise
> drives", it's not surprising that most people don't actually use
> SCSI/SAS drives that have implemented TCQ.

What different in our positions is that you are considering storage as something 
you can connect to your desktop, while in my view storage is something, which 
stores data and serves them the best possible way with the best performance.

Hence, for you the least common denominator of all storage features is the most 
important, while for me to get the best of what possible from storage is the most 
important.

In my view storage should offload from the host system as much as possible: data 
movements, ordered operations requirements, atomic operations, deduplication, 
snapshots, reliability measures (eg RAIDs), load balancing, etc.

It's the same as with 2D/3D video acceleration hardware. If you want the best 
performance from your system, you should offload from it as much as possible. In 
case of video - to the video hardware, in case of storage - to the storage. The 
same as with video, for storage better offload - better performance. On hundreds 
of thousands IOPS it's clearly visible.

Price doesn't matter here, because it's completely different topic.

> SATA's Native Command
> Queuing (NCQ) is not equivalent; this allows the drive to reorder
> requests (in particular read requests) so they can be serviced more
> efficiently, but it does *not* allow the OS to specify a partial,
> relative ordering of requests.

And so? If SATA can't do it, does it mean that nobody else can't do it too? I know 
a plenty of non-SATA devices, which can do the ordering requirements you need.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25 13:50                                       ` Theodore Ts'o
  (?)
@ 2012-10-27  1:55                                       ` Vladislav Bolkhovitin
  -1 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-27  1:55 UTC (permalink / raw)
  To: Theodore Ts'o, Alan Cox, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh


Theodore Ts'o, on 10/25/2012 09:50 AM wrote:
> Yeah....  I don't buy that.  One, flash is still too expensive.  Two,
> the capital costs to build enough Silicon foundries to replace the
> current production volume of HDD's is way too expensive for any
> company to afford (the cloud providers are buying *huge* numbers of
> HDD's) --- and that's assuming companies wouldn't chose to use those
> foundries for products with larger margins --- such as, for example,
> CPU/GPU chips. :-) And third and finally, if you study the long-term
> trends in terms of Data Retention Time (going down), Program and Read
> Disturb (going up), and Write Endurance (going down) as a function of
> feature size and/or time, you'd be wise to treat flash as nothing more
> than short-term cache, and not as a long term stable store.
>
> If end users completely give up on flash, and store all of their
> precious family pictures on flash storage, after a couple of years,
> they are likely going to be very disappointed....
>
> Speaking personally, I wouldn't want to have anything on flash for
> more than a few months at *most* before I made sure I had another copy
> saved on spinning rust platters for long-term retention.

Here I agree with you.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]                                     ` <508B3EED.2080003-d+Crzxg7Rs0@public.gmane.org>
@ 2012-10-27  4:44                                       ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-27  4:44 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:
> What different in our positions is that you are considering storage
> as something you can connect to your desktop, while in my view
> storage is something, which stores data and serves them the best
> possible way with the best performance.

I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.

As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?

> Price doesn't matter here, because it's completely different topic.

It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.

     	    	   	 	    - Ted

P.S.  All of the storage I have access to at home is SATA.  If someone
would like to change that and ship me free hardware, as long as it
doesn't require three-phase power (or require some exotic interconnect
which is ghastly expensive and which you are also not going to provide
me for free), do contact me off-line.  :-)

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-27  4:44                                       ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-10-27  4:44 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:
> What different in our positions is that you are considering storage
> as something you can connect to your desktop, while in my view
> storage is something, which stores data and serves them the best
> possible way with the best performance.

I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.

As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware.  Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features.  Perhaps now you'll understand better it's not happening?

> Price doesn't matter here, because it's completely different topic.

It matters if you think I'm going to do it on my own time, out of my
own budget.  And if you think my employer is going to choose to use
said hardware, price definitely matters.  I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.

It's rare that you get to design something where performance matters
above all else.  Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-).  But for
the rest of the world, price absolutely matters.

     	    	   	 	    - Ted

P.S.  All of the storage I have access to at home is SATA.  If someone
would like to change that and ship me free hardware, as long as it
doesn't require three-phase power (or require some exotic interconnect
which is ghastly expensive and which you are also not going to provide
me for free), do contact me off-line.  :-)

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-27  4:44                                       ` Theodore Ts'o
  (?)
@ 2012-10-30 22:22                                       ` Vladislav Bolkhovitin
  2012-10-31  9:54                                           ` Alan Cox
  -1 siblings, 1 reply; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-10-30 22:22 UTC (permalink / raw)
  To: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh


Theodore Ts'o, on 10/27/2012 12:44 AM wrote:
> On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:
>> What different in our positions is that you are considering storage
>> as something you can connect to your desktop, while in my view
>> storage is something, which stores data and serves them the best
>> possible way with the best performance.
>
> I don't get paid to make Linux storage work well for gold-plated
> storage, and as far as I know, none of the purveyors of said gold
> plated software systems are currently employing Linux file system
> developers to make Linux file systems work well on said gold-plated
> hardware.

I don't want to flame on this topic, but you are not right here. As far as I can 
see, a big chunk of Linux storage and file system developers are/were employed by 
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.

You know, RedHat from recent times also stepped to this market, at least I saw 
their advertisement on SDC 2012. So, you can add here all RedHat employees.

> As for what I might do on my own time, for fun, I can't afford said
> gold-plated hardware, and personally I get a lot more satisfaction if
> I know there will be a large number of people who benefit from my work
> (it was really cool when I found out that millions and millions of
> Android devices were going to be using ext4 :-), as opposed to a very
> small number of people who have paid $$$ to storage vendors who don't
> feel it's worthwhile to pay core Linux file system developers to
> leverage their hardware.  Earlier, you were bemoaning why Linux file
> system developers weren't paying attention to using said fancy SCSI
> features.  Perhaps now you'll understand better it's not happening?
>
>> Price doesn't matter here, because it's completely different topic.
>
> It matters if you think I'm going to do it on my own time, out of my
> own budget.  And if you think my employer is going to choose to use
> said hardware, price definitely matters.  I consider engineering to be
> the art of making tradeoffs, and price is absolutely one of the things
> that we need to trade off against other goals.
>
> It's rare that you get to design something where performance matters
> above all else.  Maybe it's that way if you're paid by folks whose job
> it is to destablize the world's financial markets by pushing the holes
> into the right half plane (i.e., high frequency trading :-).  But for
> the rest of the world, price absolutely matters.

I fully understand your position. But "affordable" and "useful" are completely 
orthogonal things. The "high end" features are very useful, if you want to get 
high performance. Then ones, who can afford them, will use them, which might be 
your favorite bank, for instance, hence they will be indirectly working for you.

Of course, you don't have to work on those features, especially for free, but you 
similarly don't have then to call them useless only because they are not 
affordable to be put in a desktop [1].

Our discussion started not from "value-for-money", but from a constant demand to 
perform ordered commands without full queue draining, which is ignored by the 
Linux storage developers for YEARS as not useful, right?

Vlad

[1] If you or somebody else want to put something supporting all necessary 
features to perform ORDERED commands, including ACA, in a desktop, you can look at 
modern SAS SSDs. I can't call price for those devices "high-end".



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25  6:02                                           ` [sqlite] " Theodore Ts'o
  2012-10-25  6:58                                             ` david
@ 2012-10-30 23:49                                             ` Nico Williams
  1 sibling, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-10-30 23:49 UTC (permalink / raw)
  To: Theodore Ts'o, Nico Williams, david,
	杨苏立 Yang Su Li, linux-fsdevel, linux-kernel

[Dropping sqlite-users.  Note that I'm not subscribed to any of the
other lists cc'ed.]

On Thu, Oct 25, 2012 at 1:02 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
>>
>> By trusting fsync().  And if you don't care about immediate Durability
>> you can run the fsync() in a background thread and mark the associated
>> transaction as completed in the next transaction to be written after
>> the fsync() completes.

You are all missing some context which I would have added had I
noticed the cc'ing of additional lists.

D.R. Hipp asked for a light-weight barrier API from the OS/filesystem,
the SQLite use-case being to implement fast ACI_ semantics, without
durability (i.e., that it be OK to lose the last few transactions, but
not to end up with a corrupt DB, and maintaining atomicity,
consistency, and isolation).

I noted that a journalled/COW DB file format[0] one could run an
fsync() in a "background" thread to act as a barrier, and then note in
each transaction the last preceding transaction known to have reached
disk (because fsync() returned and the bg thread marked the
transaction in question as durable).  Then refrain from garbage
collecting any transactions not marked as durable.  Now, there are
some caveats, the main one being that this fails if the filesystem or
hardware lie about fsync() / cache flushes.  Other caveats include
that fsync() used this way can have more impact on filesystem
performance than a true light-weight barrier[1], that the filesystem
itself might not be powerfail-safe, and maybe a few others.  But the
point is that fsync() can be used in such a way that one need not wait
for a transaction to reach rotating rust stably and still retain
powerfail safety without durability for the last few transactions.

[0] Like the BSD4.4 log structured filesystem, ZFS, Howard Chu's MDB,
and many others.  Note that ZFS has a pool-import time option to
recover from power failures by ignoring any not completely verifiable
transactions and rolling back to the last verifiable one.

[1] Think of what ZFS does when there's no ZIL and an fsync() comes
along: ZFS will either block the fsync() thread until the current
transaction closes or else close the current transaction and possibly
write a much smaller transaction, thus losing out on making writes as
large and contiguous as possible.

> The challenge is when you have entagled metadata updates.  That is,
> you update file A, and file B, and file A and B might share metadata.
> In order to sync file A, you also have to update part of the metadata
> for the updates to file B, which means calculating the dependencies of
> what you have to drag in can get very complicated.  You can keep track
> of what bits of the metadata you have to undo and then redo before
> writing out the metadata for fsync(A), but that basically means you
> have to implement soft updates, and all of the complexity this
> implies: http://lwn.net/Articles/339337/

I believe that my suggestion composes for multi-file DB file formats,
as long as the sum total forms a COWish on-disk format.  Of course,
adding more fsync()s, even if run in bg threads, may impact system
performance even more (see above).  Also, if one has a COWish DB then
why use more than one file?  If the answer were "to spread contents
across devices" one might ask "why not trust the filesystem/volume
manager to do that?", but hey.

I'm not actually proposing that people try to compose this ACI_
technique though...

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-10-31  9:54                                           ` Alan Cox
  0 siblings, 0 replies; 154+ messages in thread
From: Alan Cox @ 2012-10-31  9:54 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

> I don't want to flame on this topic, but you are not right here. As far as I can 
> see, a big chunk of Linux storage and file system developers are/were employed by 
> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.
> 
> You know, RedHat from recent times also stepped to this market, at least I saw 
> their advertisement on SDC 2012. So, you can add here all RedHat employees.

Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point

(and I used to have fibrechannel on my Thinkpad 600 when docked 8))

> Our discussion started not from "value-for-money", but from a constant demand to 
> perform ordered commands without full queue draining, which is ignored by the 
> Linux storage developers for YEARS as not useful, right?

Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-10-31  9:54                                           ` Alan Cox
  0 siblings, 0 replies; 154+ messages in thread
From: Alan Cox @ 2012-10-31  9:54 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

> I don't want to flame on this topic, but you are not right here. As far as I can 
> see, a big chunk of Linux storage and file system developers are/were employed by 
> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.
> 
> You know, RedHat from recent times also stepped to this market, at least I saw 
> their advertisement on SDC 2012. So, you can add here all RedHat employees.

Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point

(and I used to have fibrechannel on my Thinkpad 600 when docked 8))

> Our discussion started not from "value-for-money", but from a constant demand to 
> perform ordered commands without full queue draining, which is ignored by the 
> Linux storage developers for YEARS as not useful, right?

Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-31  9:54                                           ` Alan Cox
  (?)
@ 2012-11-01 20:18                                           ` Vladislav Bolkhovitin
  2012-11-01 21:24                                               ` Alan Cox
  -1 siblings, 1 reply; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-01 20:18 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh


Alan Cox, on 10/31/2012 05:54 AM wrote:
>> I don't want to flame on this topic, but you are not right here. As far as I can
>> see, a big chunk of Linux storage and file system developers are/were employed by
>> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.
>>
>> You know, RedHat from recent times also stepped to this market, at least I saw
>> their advertisement on SDC 2012. So, you can add here all RedHat employees.
>
> Booleans generally should be reserved for logic operators. Most of the
> Linux companies work on both low and high end storage. The two are not
> mutually exclusive nor do they divide neatly by market. Many big clouds
> use cheap low end drives by the crate, some high end desktops are using
> SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
> not sure personally there is much point

Those doesn't contradict the point that high performance storage vendors are also 
funding Linux kernel storage development.

> Send patches with benchmarks demonstrating it is useful. It's really
> quite simple. Code talks.

How about that recently preliminary infrastructure to send ORDERED commands 
instead of queue draining was deleted from the kernel, because "there's no 
difference where to drain the queue, on the kernel or the storage side"?

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-01 21:24                                               ` Alan Cox
  0 siblings, 0 replies; 154+ messages in thread
From: Alan Cox @ 2012-11-01 21:24 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Theodore Ts'o, 杨苏立 Yang Su Li,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel, drh

> How about that recently preliminary infrastructure to send ORDERED commands 
> instead of queue draining was deleted from the kernel, because "there's no 
> difference where to drain the queue, on the kernel or the storage side"?

Send patches.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-01 21:24                                               ` Alan Cox
  0 siblings, 0 replies; 154+ messages in thread
From: Alan Cox @ 2012-11-01 21:24 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

> How about that recently preliminary infrastructure to send ORDERED commands 
> instead of queue draining was deleted from the kernel, because "there's no 
> difference where to drain the queue, on the kernel or the storage side"?

Send patches.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-01 21:24                                               ` Alan Cox
  (?)
@ 2012-11-02  0:15                                               ` Vladislav Bolkhovitin
  -1 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-02  0:15 UTC (permalink / raw)
  To: Alan Cox
  Cc: Theodore Ts'o, 杨苏立 Yang Su Li,
	linux-kernel, linux-fsdevel, drh


Alan Cox, on 11/01/2012 05:24 PM wrote:
>> How about that recently preliminary infrastructure to send ORDERED commands
>> instead of queue draining was deleted from the kernel, because "there's no
>> difference where to drain the queue, on the kernel or the storage side"?
>
> Send patches.

OK, then we have a good progress!

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-02  0:38                                                 ` Howard Chu
  0 siblings, 0 replies; 154+ messages in thread
From: Howard Chu @ 2012-11-02  0:38 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Alan Cox, Vladislav Bolkhovitin, Theodore Ts'o, drh,
	linux-kernel, linux-fsdevel

Alan Cox wrote:
>> How about that recently preliminary infrastructure to send ORDERED commands
>> instead of queue draining was deleted from the kernel, because "there's no
>> difference where to drain the queue, on the kernel or the storage side"?
>
> Send patches.

Isn't any type of kernel-side ordering an exercise in futility, since
   a) the kernel has no knowledge of the disk's actual geometry
   b) most drives will internally re-order requests anyway
   c) cheap drives won't support barriers

Even assuming the drives honored all your requests without lying, how would 
you really want this behavior exposed? From the userland perspective, there 
are very few apps that care. Probably only transactional databases, really.

As a DB author, I'm not sure I'd be keen on this as an open() or fcntl() 
option. Databases that really care would be on dedicated filesystems and/or 
devices, so per-file control would be tedious. You would most likely want to 
say "all writes to this string of devices should be order-preserving" and 
forget about it. With that guarantee, a careful writer can have perfectly 
intact data structures all the time, without ever slowing down for a fsync.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-02  0:38                                                 ` Howard Chu
  0 siblings, 0 replies; 154+ messages in thread
From: Howard Chu @ 2012-11-02  0:38 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Theodore Ts'o, drh-X1OJI8nnyKUAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vladislav Bolkhovitin,
	Alan Cox

Alan Cox wrote:
>> How about that recently preliminary infrastructure to send ORDERED commands
>> instead of queue draining was deleted from the kernel, because "there's no
>> difference where to drain the queue, on the kernel or the storage side"?
>
> Send patches.

Isn't any type of kernel-side ordering an exercise in futility, since
   a) the kernel has no knowledge of the disk's actual geometry
   b) most drives will internally re-order requests anyway
   c) cheap drives won't support barriers

Even assuming the drives honored all your requests without lying, how would 
you really want this behavior exposed? From the userland perspective, there 
are very few apps that care. Probably only transactional databases, really.

As a DB author, I'm not sure I'd be keen on this as an open() or fcntl() 
option. Databases that really care would be on dedicated filesystems and/or 
devices, so per-file control would be tedious. You would most likely want to 
say "all writes to this string of devices should be order-preserving" and 
forget about it. With that guarantee, a careful writer can have perfectly 
intact data structures all the time, without ever slowing down for a fsync.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
       [not found]                                                 ` <50931601.4060102-aQkYFu9vm6AAvxtiuMwx3w@public.gmane.org>
@ 2012-11-02 12:24                                                   ` Richard Hipp
  2012-11-13  3:41                                                     ` [sqlite] " Vladislav Bolkhovitin
  0 siblings, 1 reply; 154+ messages in thread
From: Richard Hipp @ 2012-11-02 12:24 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Vladislav Bolkhovitin, drh-X1OJI8nnyKUAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Theodore Ts'o,
	Alan Cox

On Thu, Nov 1, 2012 at 8:38 PM, Howard Chu <hyc-aQkYFu9vm6AAvxtiuMwx3w@public.gmane.org> wrote:

> Alan Cox wrote:
>
>> How about that recently preliminary infrastructure to send ORDERED
>>> commands
>>> instead of queue draining was deleted from the kernel, because "there's
>>> no
>>> difference where to drain the queue, on the kernel or the storage side"?
>>>
>>
>> Send patches.
>>
>
> Isn't any type of kernel-side ordering an exercise in futility, since
>   a) the kernel has no knowledge of the disk's actual geometry
>   b) most drives will internally re-order requests anyway
>   c) cheap drives won't support barriers
>
> Even assuming the drives honored all your requests without lying, how
> would you really want this behavior exposed? From the userland perspective,
> there are very few apps that care. Probably only transactional databases,
> really.
>
> As a DB author, I'm not sure I'd be keen on this as an open() or fcntl()
> option. Databases that really care would be on dedicated filesystems and/or
> devices, so per-file control would be tedious. You would most likely want
> to say "all writes to this string of devices should be order-preserving"
> and forget about it. With that guarantee, a careful writer can have
> perfectly intact data structures all the time, without ever slowing down
> for a fsync.
>
>
SQLite cares.  SQLite is an in-process, transaction, zero-configuration
database that is estimated to be used by over 1 million distinct
applications and to be have over 2 billion deployments.  SQLite uses
ordinary disk files in ordinary directories, often selected by the
end-user.  There is no system administrator with SQLite, so there is no
opportunity to use a dedicated filesystem with special mount options.

SQLite uses fsync() as a write barrier to assure consistency following a
power loss.  In addition, we do everything we can to maximize the amount of
time after the fsync() before we actually do another write where order
matters, in the hopes that the writes will still be ordered on platforms
where fsync() is ignored for whatever reason.  Even so, we believe we could
get a significant performance boost and reliability improvement if we had a
reliable write barrier.


> --
>   -- Howard Chu
>   CTO, Symas Corp.           http://www.symas.com
>   Director, Highland Sun     http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  http://www.openldap.org/**project/<http://www.openldap.org/project/>
>
> ______________________________**_________________
> sqlite-users mailing list
> sqlite-users-CzDROfG0BjIdnm+yROfE0A@public.gmane.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users<http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users>
>



-- 
D. Richard Hipp
drh-CzDROfG0BjIdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]                                                 ` <50931601.4060102-aQkYFu9vm6AAvxtiuMwx3w@public.gmane.org>
@ 2012-11-02 12:33                                                   ` Alan Cox
  0 siblings, 0 replies; 154+ messages in thread
From: Alan Cox @ 2012-11-02 12:33 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin,
	Theodore Ts'o, drh, linux-kernel, linux-fsdevel

> Isn't any type of kernel-side ordering an exercise in futility, since
>    a) the kernel has no knowledge of the disk's actual geometry
>    b) most drives will internally re-order requests anyway

They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.

>    c) cheap drives won't support barriers

Barriers are pretty much universal as you need them for power off !

> Even assuming the drives honored all your requests without lying, how would 
> you really want this behavior exposed? From the userland perspective, there 
> are very few apps that care. Probably only transactional databases, really.

And file systems internally sometimes. A file system is after all a
transactional database of sorts.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-02 12:33                                                   ` Alan Cox
  0 siblings, 0 replies; 154+ messages in thread
From: Alan Cox @ 2012-11-02 12:33 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vladislav Bolkhovitin

> Isn't any type of kernel-side ordering an exercise in futility, since
>    a) the kernel has no knowledge of the disk's actual geometry
>    b) most drives will internally re-order requests anyway

They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.

>    c) cheap drives won't support barriers

Barriers are pretty much universal as you need them for power off !

> Even assuming the drives honored all your requests without lying, how would 
> you really want this behavior exposed? From the userland perspective, there 
> are very few apps that care. Probably only transactional databases, really.

And file systems internally sometimes. A file system is after all a
transactional database of sorts.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-10-25 18:29                                                     ` Theodore Ts'o
@ 2012-11-05 20:03                                                       ` Pavel Machek
  -1 siblings, 0 replies; 154+ messages in thread
From: Pavel Machek @ 2012-11-05 20:03 UTC (permalink / raw)
  To: Theodore Ts'o, david, Nico Williams,
	General Discussion of SQLite Database, ????????? Yang Su Li,
	linux-fsdevel, linux-kernel, drh

On Thu 2012-10-25 14:29:48, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 11:03:13AM -0700, david@lang.hm wrote:
> > I agree, this is why I'm trying to figure out the recommended way to
> > do this without needing to do full commits.
> > 
> > Since in most cases it's acceptable to loose the last few chunks
> > written, if we had some way of specifying ordering, without having
> > to specify "write this NOW", the solution would be pretty obvious.
> 
> Well, using data journalling with ext3/4 may do what you want.  If you
> don't do any fsync, the changes will get written every 5 seconds when
> the automatic journal sync happens (and sub-4k writes will also get

Hmm. But that would need setting journalling mode per-file, no?

Like, make it journal data for all the databases, but keep normal mode
for rest of system...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-05 20:03                                                       ` Pavel Machek
  0 siblings, 0 replies; 154+ messages in thread
From: Pavel Machek @ 2012-11-05 20:03 UTC (permalink / raw)
  To: Theodore Ts'o, david, Nico Williams,
	General Discussion of SQLite Database, ????????? Yang Su Li,
	linux-fsdevel, linux-kernel, drh

On Thu 2012-10-25 14:29:48, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 11:03:13AM -0700, david@lang.hm wrote:
> > I agree, this is why I'm trying to figure out the recommended way to
> > do this without needing to do full commits.
> > 
> > Since in most cases it's acceptable to loose the last few chunks
> > written, if we had some way of specifying ordering, without having
> > to specify "write this NOW", the solution would be pretty obvious.
> 
> Well, using data journalling with ext3/4 may do what you want.  If you
> don't do any fsync, the changes will get written every 5 seconds when
> the automatic journal sync happens (and sub-4k writes will also get

Hmm. But that would need setting journalling mode per-file, no?

Like, make it journal data for all the databases, but keep normal mode
for rest of system...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-05 22:04                                                         ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-11-05 22:04 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Nico Williams, General Discussion of SQLite Database,
	????????? Yang Su Li, linux-fsdevel, linux-kernel, drh

On Mon, Nov 05, 2012 at 09:03:48PM +0100, Pavel Machek wrote:
> > Well, using data journalling with ext3/4 may do what you want.  If you
> > don't do any fsync, the changes will get written every 5 seconds when
> > the automatic journal sync happens (and sub-4k writes will also get
> 
> Hmm. But that would need setting journalling mode per-file, no?
> 
> Like, make it journal data for all the databases, but keep normal mode
> for rest of system...

You can do that, using "chattr +j file.db".  It's apparently not a
well known feature of ext3/4....

						- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-05 22:04                                                         ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-11-05 22:04 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david-gFPdbfVZQbY, General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 05, 2012 at 09:03:48PM +0100, Pavel Machek wrote:
> > Well, using data journalling with ext3/4 may do what you want.  If you
> > don't do any fsync, the changes will get written every 5 seconds when
> > the automatic journal sync happens (and sub-4k writes will also get
> 
> Hmm. But that would need setting journalling mode per-file, no?
> 
> Like, make it journal data for all the databases, but keep normal mode
> for rest of system...

You can do that, using "chattr +j file.db".  It's apparently not a
well known feature of ext3/4....

						- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
       [not found]                                                         ` <20121105220440.GB25378-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
@ 2012-11-05 22:37                                                           ` Richard Hipp
  2012-11-05 23:00                                                               ` Theodore Ts'o
  0 siblings, 1 reply; 154+ messages in thread
From: Richard Hipp @ 2012-11-05 22:37 UTC (permalink / raw)
  To: General Discussion of SQLite Database, Theodore Ts'o,
	Pavel Machek, david-gFPdbfVZQbY, Nico Williams,
	????????? Yang Su Li, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel, drh-X1OJI8nnyKUAvxtiuMwx3w

On Mon, Nov 5, 2012 at 5:04 PM, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:

> On Mon, Nov 05, 2012 at 09:03:48PM +0100, Pavel Machek wrote:
> > > Well, using data journalling with ext3/4 may do what you want.  If you
> > > don't do any fsync, the changes will get written every 5 seconds when
> > > the automatic journal sync happens (and sub-4k writes will also get
> >
> > Hmm. But that would need setting journalling mode per-file, no?
> >
> > Like, make it journal data for all the databases, but keep normal mode
> > for rest of system...
>
> You can do that, using "chattr +j file.db".  It's apparently not a
> well known feature of ext3/4....
>

Per the docs:  "Only the superuser or a process possessing the
CAP_SYS_RESOURCE capability can set or clear this attribute."  That
prevents most applications that run SQLite from being able to take
advantage of this, since most such applications lack elevated privileges.


>
>                                                 - Ted
> _______________________________________________
> sqlite-users mailing list
> sqlite-users-CzDROfG0BjIdnm+yROfE0A@public.gmane.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
D. Richard Hipp
drh-CzDROfG0BjIdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-05 23:00                                                               ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-11-05 23:00 UTC (permalink / raw)
  To: Richard Hipp
  Cc: General Discussion of SQLite Database, Pavel Machek, david,
	Nico Williams, ????????? Yang Su Li, linux-fsdevel, linux-kernel,
	drh

On Mon, Nov 05, 2012 at 05:37:02PM -0500, Richard Hipp wrote:
> 
> Per the docs:  "Only the superuser or a process possessing the
> CAP_SYS_RESOURCE capability can set or clear this attribute."  That
> prevents most applications that run SQLite from being able to take
> advantage of this, since most such applications lack elevated privileges.

If this feature would prove useful to sqllite, that's something we
could address.  I could image making this available to processes that
belong to a specific group that would be specified in the superblock
or as a mount option.  (We already have something like that which
allows a specific uid or gid to use the reserved space in the
superblock.)

							- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-05 23:00                                                               ` Theodore Ts'o
  0 siblings, 0 replies; 154+ messages in thread
From: Theodore Ts'o @ 2012-11-05 23:00 UTC (permalink / raw)
  To: Richard Hipp
  Cc: david-gFPdbfVZQbY, General Discussion of SQLite Database,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel, Pavel Machek,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 05, 2012 at 05:37:02PM -0500, Richard Hipp wrote:
> 
> Per the docs:  "Only the superuser or a process possessing the
> CAP_SYS_RESOURCE capability can set or clear this attribute."  That
> prevents most applications that run SQLite from being able to take
> advantage of this, since most such applications lack elevated privileges.

If this feature would prove useful to sqllite, that's something we
could address.  I could image making this available to processes that
belong to a specific group that would be specified in the superblock
or as a mount option.  (We already have something like that which
allows a specific uid or gid to use the reserved space in the
superblock.)

							- Ted

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
       [not found]                                     ` <508B3EED.2080003-d+Crzxg7Rs0@public.gmane.org>
@ 2012-11-11  4:25                                       ` 杨苏立 Yang Su Li
  2012-11-13  3:42                                           ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 154+ messages in thread
From: 杨苏立 Yang Su Li @ 2012-11-11  4:25 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	General Discussion of SQLite Database, Theodore Ts'o,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Richard Hipp

On Fri, Oct 26, 2012 at 8:54 PM, Vladislav Bolkhovitin <vst-d+Crzxg7Rs0@public.gmane.org> wrote:

>
> Theodore Ts'o, on 10/25/2012 01:14 AM wrote:
>
>> On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
>>
>>> Yes, SCSI has full support for ordered/simple commands designed
>>> exactly for that task: to have steady flow of commands even in case
>>> when some of them are ordered.....
>>>
>>
>> SCSI does, yes --- *if* the device actually implements Tagged Command
>> Queuing (TCQ).  Not all devices do.
>>
>> More importantly, SATA drives do *not* have this capability, and when
>> you compare the price of SATA drives to uber-expensive "enterprise
>> drives", it's not surprising that most people don't actually use
>> SCSI/SAS drives that have implemented TCQ.
>>
>
> What different in our positions is that you are considering storage as
> something you can connect to your desktop, while in my view storage is
> something, which stores data and serves them the best possible way with the
> best performance.
>
> Hence, for you the least common denominator of all storage features is the
> most important, while for me to get the best of what possible from storage
> is the most important.
>
> In my view storage should offload from the host system as much as
> possible: data movements, ordered operations requirements, atomic
> operations, deduplication, snapshots, reliability measures (eg RAIDs), load
> balancing, etc.
>
> It's the same as with 2D/3D video acceleration hardware. If you want the
> best performance from your system, you should offload from it as much as
> possible. In case of video - to the video hardware, in case of storage - to
> the storage. The same as with video, for storage better offload - better
> performance. On hundreds of thousands IOPS it's clearly visible.
>
> Price doesn't matter here, because it's completely different topic.
>
>
>  SATA's Native Command
>> Queuing (NCQ) is not equivalent; this allows the drive to reorder
>> requests (in particular read requests) so they can be serviced more
>> efficiently, but it does *not* allow the OS to specify a partial,
>> relative ordering of requests.
>>
>
> And so? If SATA can't do it, does it mean that nobody else can't do it
> too? I know a plenty of non-SATA devices, which can do the ordering
> requirements you need.
>

I would be very much interested in what kind of device support this kind of
"topological order", and in what settings they are typically used.

Does modern flash/SSD (esp. which are used on smartphones) support this?

If you could point me to some information about this, that would be very
much appreciated.

Thanks a lot!

Suli

>
> Vlad
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/**majordomo-info.html<http://vger.kernel.org/majordomo-info.html>
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-02  0:38                                                 ` Howard Chu
                                                                   ` (2 preceding siblings ...)
  (?)
@ 2012-11-13  3:37                                                 ` Vladislav Bolkhovitin
  -1 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:37 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, Alan Cox,
	Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel,
	linux-fsdevel


Howard Chu, on 11/01/2012 08:38 PM wrote:
> Alan Cox wrote:
>>> How about that recently preliminary infrastructure to send ORDERED commands
>>> instead of queue draining was deleted from the kernel, because "there's no
>>> difference where to drain the queue, on the kernel or the storage side"?
>>
>> Send patches.
>
> Isn't any type of kernel-side ordering an exercise in futility, since
> a) the kernel has no knowledge of the disk's actual geometry
> b) most drives will internally re-order requests anyway
> c) cheap drives won't support barriers

This is why it is so important for performance to use all storage capabilities. 
Particularly, ORDERED commands instead of trying to pretend be smarter, than the 
storage, doing queue draining.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-02 12:33                                                   ` Alan Cox
@ 2012-11-13  3:41                                                     ` Vladislav Bolkhovitin
  -1 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:41 UTC (permalink / raw)
  To: Alan Cox
  Cc: Howard Chu, General Discussion of SQLite Database,
	Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel,
	linux-fsdevel


Alan Cox, on 11/02/2012 08:33 AM wrote:
>>     b) most drives will internally re-order requests anyway
>
> They will but only as permitted by the commands queued, so you have some
> control depending upon the interface capabilities.
>
>>     c) cheap drives won't support barriers
>
> Barriers are pretty much universal as you need them for power off !

I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.

Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.

3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.

Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.


Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?

This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.

Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-13  3:41                                                     ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:41 UTC (permalink / raw)
  To: Alan Cox
  Cc: Howard Chu, General Discussion of SQLite Database,
	Vladislav Bolkhovitin, Theodore Ts'o, drh, linux-kernel,
	linux-fsdevel


Alan Cox, on 11/02/2012 08:33 AM wrote:
>>     b) most drives will internally re-order requests anyway
>
> They will but only as permitted by the commands queued, so you have some
> control depending upon the interface capabilities.
>
>>     c) cheap drives won't support barriers
>
> Barriers are pretty much universal as you need them for power off !

I'm afraid, no storage (drives, if you like this term more) at the moment supports 
barriers and, as far as I know the storage history, has never supported.

Instead, what storage does support in this area are:

1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.

2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA, 
etc.

3. Atomic commands, e.g. scattered writes, which allow to write data in several 
separate not adjacent  blocks in an atomic manner, i.e. guarantee that either all 
blocks are written or none at all. This is a relatively new functionality, natural 
for flash storage with its COW internals.

Obviously, using such atomic write commands, an application or a file system don't 
need any journaling anymore. FusionIO reported that after they modified MySQL to 
use them, they had 50% performance increase.


Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently, 
including on the same request. That is the root cause why barrier concept is so 
evil. If you specify a barrier, how can you say what kind actual action you really 
want from the storage: cache flush? Or ordered write? Or both?

This is why relatively recent removal of barriers from the Linux kernel 
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step 
should be to allow ORDERED attribute for requests be accelerated by ORDERED 
commands of the storage, if it supports them. If not, fall back to the existing 
queue draining.

Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A 
simple Google search shows that only Linux uses this concept for storage. And 2 
years passed, since they were removed from the kernel, but people still discuss 
barriers as if they are here.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-02 12:24                                                   ` Richard Hipp
@ 2012-11-13  3:41                                                     ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:41 UTC (permalink / raw)
  To: Richard Hipp
  Cc: General Discussion of SQLite Database, Theodore Ts'o, drh,
	linux-kernel, linux-fsdevel, Alan Cox

Richard Hipp, on 11/02/2012 08:24 AM wrote:
> SQLite cares.  SQLite is an in-process, transaction, zero-configuration
> database that is estimated to be used by over 1 million distinct
> applications and to be have over 2 billion deployments.  SQLite uses
> ordinary disk files in ordinary directories, often selected by the
> end-user.  There is no system administrator with SQLite, so there is no
> opportunity to use a dedicated filesystem with special mount options.
>
> SQLite uses fsync() as a write barrier to assure consistency following a
> power loss.  In addition, we do everything we can to maximize the amount of
> time after the fsync() before we actually do another write where order
> matters, in the hopes that the writes will still be ordered on platforms
> where fsync() is ignored for whatever reason.  Even so, we believe we could
> get a significant performance boost and reliability improvement if we had a
> reliable write barrier.

I would suggest you to forget word "barrier" for productivity sake. You don't want 
barriers and confusion they bring. You want instead access to storage accelerated 
cache sync, commands ordering and atomic attributes/operations. See my other 
today's e-mail about those.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-11  4:25                                       ` 杨苏立 Yang Su Li
@ 2012-11-13  3:42                                           ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:42 UTC (permalink / raw)
  To: 杨苏立 Yang Su Li
  Cc: Theodore Ts'o, General Discussion of SQLite Database,
	linux-kernel, linux-fsdevel, Richard Hipp

杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:
>>  SATA's Native Command
>>> Queuing (NCQ) is not equivalent; this allows the drive to reorder
>>> requests (in particular read requests) so they can be serviced more
>>> efficiently, but it does *not* allow the OS to specify a partial,
>>> relative ordering of requests.
>>>
>>
>> And so? If SATA can't do it, does it mean that nobody else can't do it
>> too? I know a plenty of non-SATA devices, which can do the ordering
>> requirements you need.
>>
>
> I would be very much interested in what kind of device support this kind of
> "topological order", and in what settings they are typically used.
>
> Does modern flash/SSD (esp. which are used on smartphones) support this?
>
> If you could point me to some information about this, that would be very
> much appreciated.

I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.

But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-13  3:42                                           ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-13  3:42 UTC (permalink / raw)
  To: 杨苏立 Yang Su Li
  Cc: Theodore Ts'o, General Discussion of SQLite Database,
	linux-kernel, linux-fsdevel, Richard Hipp

杨苏立 Yang Su Li, on 11/10/2012 11:25 PM wrote:
>>  SATA's Native Command
>>> Queuing (NCQ) is not equivalent; this allows the drive to reorder
>>> requests (in particular read requests) so they can be serviced more
>>> efficiently, but it does *not* allow the OS to specify a partial,
>>> relative ordering of requests.
>>>
>>
>> And so? If SATA can't do it, does it mean that nobody else can't do it
>> too? I know a plenty of non-SATA devices, which can do the ordering
>> requirements you need.
>>
>
> I would be very much interested in what kind of device support this kind of
> "topological order", and in what settings they are typically used.
>
> Does modern flash/SSD (esp. which are used on smartphones) support this?
>
> If you could point me to some information about this, that would be very
> much appreciated.

I don't think storage in smartphone can support such advanced functionality, 
because it tends to be the cheapest, hence the simplest.

But many modern Enterprise SAS drives can do it, because for those customers 
performance is the key requirement. Unfortunately, I'm not sure I can name exact 
brands and models, because I had my knowledge from NDA'ed docs, so this info can 
be also NDA'ed.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-13 17:40                                                       ` Alan Cox
  0 siblings, 0 replies; 154+ messages in thread
From: Alan Cox @ 2012-11-13 17:40 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Howard Chu, General Discussion of SQLite Database,
	Theodore Ts'o, drh, linux-kernel, linux-fsdevel

> > Barriers are pretty much universal as you need them for power off !
> 
> I'm afraid, no storage (drives, if you like this term more) at the moment supports 
> barriers and, as far as I know the storage history, has never supported.

The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.

> Instead, what storage does support in this area are:

Yes - the devil is in the detail once you go beyond simple capabilities.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-13 17:40                                                       ` Alan Cox
  0 siblings, 0 replies; 154+ messages in thread
From: Alan Cox @ 2012-11-13 17:40 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	drh-X1OJI8nnyKUAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

> > Barriers are pretty much universal as you need them for power off !
> 
> I'm afraid, no storage (drives, if you like this term more) at the moment supports 
> barriers and, as far as I know the storage history, has never supported.

The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.

> Instead, what storage does support in this area are:

Yes - the devil is in the detail once you go beyond simple capabilities.

Alan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-13 19:13                                                         ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-11-13 19:13 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

On Tue, Nov 13, 2012 at 11:40 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> > Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.
>
>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.

Right: barriers are trivial to program with.  Ordered writes less so.
One could declare all writes to be ordered with respect to each other,
but this will almost certainly hurt performance (at least with disks,
though probably not SSDs) as opposed to barriers, which order one
group of internally-not-order writes relative to another.  And
declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.

There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-13 19:13                                                         ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-11-13 19:13 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Theodore Ts'o,
	Vladislav Bolkhovitin, linux-kernel, Richard Hipp

On Tue, Nov 13, 2012 at 11:40 AM, Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> wrote:
>> > Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.
>
>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.

Right: barriers are trivial to program with.  Ordered writes less so.
One could declare all writes to be ordered with respect to each other,
but this will almost certainly hurt performance (at least with disks,
though probably not SSDs) as opposed to barriers, which order one
group of internally-not-order writes relative to another.  And
declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.

There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.

My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-13 17:40                                                       ` Alan Cox
@ 2012-11-15  1:16                                                         ` Vladislav Bolkhovitin
  -1 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-15  1:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Howard Chu, General Discussion of SQLite Database,
	Theodore Ts'o, drh, linux-kernel, linux-fsdevel


Alan Cox, on 11/13/2012 12:40 PM wrote:
>>> Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.

The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.

>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.

None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-15  1:16                                                         ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-15  1:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Howard Chu, General Discussion of SQLite Database,
	Theodore Ts'o, drh, linux-kernel, linux-fsdevel


Alan Cox, on 11/13/2012 12:40 PM wrote:
>>> Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.

The cache flush is cache flush. You can call it barrier, if you want to continue 
confusing yourself and others.

>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.

None of those details brings anything not solvable. For instance, I already 
described in this thread a simple way how requested order of commands can be 
carried through the stack and implemented that algorithm in SCST.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-13 19:13                                                         ` Nico Williams
  (?)
@ 2012-11-15  1:17                                                         ` Vladislav Bolkhovitin
  2012-11-15 12:07                                                             ` David Lang
  2012-11-15 17:06                                                             ` Ryan Johnson
  -1 siblings, 2 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-15  1:17 UTC (permalink / raw)
  To: Nico Williams
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel


Nico Williams, on 11/13/2012 02:13 PM wrote:
> declaring groups of internally-unordered writes where the groups are
> ordered with respect to each other... is practically the same as
> barriers.

Which barriers? Barriers meaning cache flush or barriers meaning commands order, 
or barriers meaning both?

There's no such thing as "barrier". It is fully artificial abstraction. After all, 
at the bottom of your stack, you will have to translate it either to cache flush, 
or commands order enforcement, or both.

Are you going to invent 3 types of barriers?

> There's a lot to be said for simplicity... as long as the system is
> not so simple as to not work at all.
>
> My p.o.v. is that a filesystem write barrier is effectively the same
> as fsync() with the ability to return sooner (before writes hit stable
> storage) when the filesystem and hardware support on-disk layouts and
> primitives which can be used to order writes preceding and succeeding
> the barrier.

Your mistake is that you are considering barriers as something real, which can do 
something real for you, while it is just a artificial abstraction apparently 
invented by people with limited knowledge how storage works, hence having very 
foggy vision how barriers supposed to be processed by it. A simple wrong answer.

Generally, you can invent any abstraction convenient for you, but farther your 
abstractions from reality of your hardware => less you will get from it with 
bigger effort.

There are no barriers in Linux and not going to be. Accept it. And start instead 
thinking about offload capabilities your storage can offer to you.

Vlad


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-15 12:07                                                             ` David Lang
  0 siblings, 0 replies; 154+ messages in thread
From: David Lang @ 2012-11-15 12:07 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Nico Williams, General Discussion of SQLite Database,
	Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel

On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:

> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning commands 
> order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial abstraction. After 
> all, at the bottom of your stack, you will have to translate it either to 
> cache flush, or commands order enforcement, or both.

When people talk about barriers, they are talking about order enforcement.

> Your mistake is that you are considering barriers as something real, which 
> can do something real for you, while it is just a artificial abstraction 
> apparently invented by people with limited knowledge how storage works, hence 
> having very foggy vision how barriers supposed to be processed by it. A 
> simple wrong answer.
>
> Generally, you can invent any abstraction convenient for you, but farther 
> your abstractions from reality of your hardware => less you will get from it 
> with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And start 
> instead thinking about offload capabilities your storage can offer to you.

the hardware capabilities are not directly accessable from userspace (and they 
probably shouldn't be)

barriers keep getting mentioned because they are a easy concept to understand. 
"do this set of stuff before doing any of this other set of stuff, but I don't 
care when any of this gets done" and they fit well with the requirements of the 
users.

Users readily accept that if the system crashes, they will loose the most recent 
stuff that they did, but they get annoyed when things get corrupted to the point 
that they loose the entire file.

this includes things like modifying one option and a crash resulting in the 
config file being blank. Yes, you can do the 'write to temp file, sync file, 
sync directory, rename file" dance, but the fact that to do so the user must sit 
and wait for the syncs to take place can be a problem. It would be far better to 
be able to say "write to temp file, and after it's on disk, rename the file" and 
not have the user wait. The user doesn't really care if the changes hit disk 
immediately, or several seconds (or even 10s of seconds) later, as long as there 
is not any possibility of the rename hitting disk before the file contents.

The fact that this could be implemented in multiple ways in the existing 
hardware does not mean that there need to be multiple ways exposed to userspace, 
it just means that the cost of doing the operation will vary depending on the 
hardware that you have. This also means that if new hardware introduces a new 
way of implementing this, that improvement can be passed on to the users without 
needing application changes.

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-15 12:07                                                             ` David Lang
  0 siblings, 0 replies; 154+ messages in thread
From: David Lang @ 2012-11-15 12:07 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:

> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning commands 
> order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial abstraction. After 
> all, at the bottom of your stack, you will have to translate it either to 
> cache flush, or commands order enforcement, or both.

When people talk about barriers, they are talking about order enforcement.

> Your mistake is that you are considering barriers as something real, which 
> can do something real for you, while it is just a artificial abstraction 
> apparently invented by people with limited knowledge how storage works, hence 
> having very foggy vision how barriers supposed to be processed by it. A 
> simple wrong answer.
>
> Generally, you can invent any abstraction convenient for you, but farther 
> your abstractions from reality of your hardware => less you will get from it 
> with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And start 
> instead thinking about offload capabilities your storage can offer to you.

the hardware capabilities are not directly accessable from userspace (and they 
probably shouldn't be)

barriers keep getting mentioned because they are a easy concept to understand. 
"do this set of stuff before doing any of this other set of stuff, but I don't 
care when any of this gets done" and they fit well with the requirements of the 
users.

Users readily accept that if the system crashes, they will loose the most recent 
stuff that they did, but they get annoyed when things get corrupted to the point 
that they loose the entire file.

this includes things like modifying one option and a crash resulting in the 
config file being blank. Yes, you can do the 'write to temp file, sync file, 
sync directory, rename file" dance, but the fact that to do so the user must sit 
and wait for the syncs to take place can be a problem. It would be far better to 
be able to say "write to temp file, and after it's on disk, rename the file" and 
not have the user wait. The user doesn't really care if the changes hit disk 
immediately, or several seconds (or even 10s of seconds) later, as long as there 
is not any possibility of the rename hitting disk before the file contents.

The fact that this could be implemented in multiple ways in the existing 
hardware does not mean that there need to be multiple ways exposed to userspace, 
it just means that the cost of doing the operation will vary depending on the 
hardware that you have. This also means that if new hardware introduces a new 
way of implementing this, that improvement can be passed on to the users without 
needing application changes.

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
       [not found]                                                             ` <alpine.DEB.2.02.1211150353080.32408-UEhY+ZBZOcqqLGM74eQ/YA@public.gmane.org>
@ 2012-11-15 16:14                                                               ` 杨苏立 Yang Su Li
  2012-11-17  5:02                                                                   ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 154+ messages in thread
From: 杨苏立 Yang Su Li @ 2012-11-15 16:14 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Theodore Ts'o,
	Vladislav Bolkhovitin, linux-kernel, Richard Hipp

On Thu, Nov 15, 2012 at 6:07 AM, David Lang <david-gFPdbfVZQbY@public.gmane.org> wrote:

> On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:
>
>  Nico Williams, on 11/13/2012 02:13 PM wrote:
>>
>>> declaring groups of internally-unordered writes where the groups are
>>> ordered with respect to each other... is practically the same as
>>> barriers.
>>>
>>
>> Which barriers? Barriers meaning cache flush or barriers meaning commands
>> order, or barriers meaning both?
>>
>> There's no such thing as "barrier". It is fully artificial abstraction.
>> After all, at the bottom of your stack, you will have to translate it
>> either to cache flush, or commands order enforcement, or both.
>>
>
> When people talk about barriers, they are talking about order enforcement.
>
>
>  Your mistake is that you are considering barriers as something real,
>> which can do something real for you, while it is just a artificial
>> abstraction apparently invented by people with limited knowledge how
>> storage works, hence having very foggy vision how barriers supposed to be
>> processed by it. A simple wrong answer.
>>
>> Generally, you can invent any abstraction convenient for you, but farther
>> your abstractions from reality of your hardware => less you will get from
>> it with bigger effort.
>>
>> There are no barriers in Linux and not going to be. Accept it. And start
>> instead thinking about offload capabilities your storage can offer to you.
>>
>
> the hardware capabilities are not directly accessable from userspace (and
> they probably shouldn't be)
>
> barriers keep getting mentioned because they are a easy concept to
> understand. "do this set of stuff before doing any of this other set of
> stuff, but I don't care when any of this gets done" and they fit well with
> the requirements of the users.
>

Well, I think there are two questions to be answered here: what primitive
should be offered to the user by the file system (currently we have fsync);
and what primitive should be offered by the lower level and used by the
file system (currently we have barrier, or flushing and FUA).

I do agree that we should keep what is accessible from user-space simple
and stupid. However if you look into fsync semantics a bit closer, I think
there are two things to be noted:

1. fsync actually does two things at the same time: ordering writes (in a
barrier-like manner), and forcing cached writes to disk. This makes it very
difficult to implement fsync efficiently. However, logically they are two
distinctive functionalities, and user might want one but not the other.
Particularly, users might want a barrier, but doesn't care about durability
(much). I have no idea why ordering and durability, which seem quite
different, ended up being bundled together in a single fsync call.

2. fsync semantic in POSIX is a bit vague (at least to me) in a concurrent
setting. What is the expected behavior when more than one thread write to
the same file descriptor, or different file descriptor associated with the
same file?

So I do think in the user space, we need some kind of barrier (or other)
primitive which is not tied to durability guarantees; and hopefully this
primitive could be implemented more efficiently than fsync. And of course,
this primitive should be simple and intuitive, abstracting the complexity
out.


On the other hand, we have the questions of what should file system use.
Traditionally block layer provides barrier primitive, and now I think they
are moving to flushing and FUA, or even ordered commands. (
http://lwn.net/Articles/400541/).

In terms of whether file system should be exposed with the hardware
capability, in this case, ordered commands. I personally think it should.
In modern file system we do all kind of stuff to ensure ordering, and I
think I can see how leveraging ordered commands (when it is available from
hardware) could potentially boost performance. And all the complexity of,
say, topological order, is dealt within the file system, and is not visible
to the user.

Of course, there are challenges in when you want to do ordered writes in
file system. As Ts'o mentioned, *when you have entagled metadata updates,
i.e., *you update file A, and file B, and file A and B might share
metadata, it could be difficult to get the ordering right without
sacrificing performance. But I personally think it is worth exploring.

Suli


>
> Users readily accept that if the system crashes, they will loose the most
> recent stuff that they did, but they get annoyed when things get corrupted
> to the point that they loose the entire file.
>
> this includes things like modifying one option and a crash resulting in
> the config file being blank. Yes, you can do the 'write to temp file, sync
> file, sync directory, rename file" dance, but the fact that to do so the
> user must sit and wait for the syncs to take place can be a problem. It
> would be far better to be able to say "write to temp file, and after it's
> on disk, rename the file" and not have the user wait. The user doesn't
> really care if the changes hit disk immediately, or several seconds (or
> even 10s of seconds) later, as long as there is not any possibility of the
> rename hitting disk before the file contents.
>
> The fact that this could be implemented in multiple ways in the existing
> hardware does not mean that there need to be multiple ways exposed to
> userspace, it just means that the cost of doing the operation will vary
> depending on the hardware that you have. This also means that if new
> hardware introduces a new way of implementing this, that improvement can be
> passed on to the users without needing application changes.
>
> David Lang
>
> ______________________________**_________________
> sqlite-users mailing list
> sqlite-users-CzDROfG0BjIdnm+yROfE0A@public.gmane.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users<http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users>
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-15 17:06                                                             ` Ryan Johnson
  0 siblings, 0 replies; 154+ messages in thread
From: Ryan Johnson @ 2012-11-15 17:06 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Vladislav Bolkhovitin, Nico Williams, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp

On 14/11/2012 8:17 PM, Vladislav Bolkhovitin wrote:
> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning 
> commands order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial 
> abstraction. After all, at the bottom of your stack, you will have to 
> translate it either to cache flush, or commands order enforcement, or 
> both.
Isn't that  why we *have* "the stack" in the first place? So apps 
*don't* have to worry about how the OS implements an artificial (= 
high-level and portable) abstraction on a given device?

>
> Are you going to invent 3 types of barriers?
One will do, it just needs to be a good one.

Maybe I'm missing something here, so I'm going to back up a bit and 
recap what I understand.

The filesystem abstracts the concept of encoding patterns of bits in 
some physical media (data), and making it easy to find and retrieve 
those bits later (metadata, incl. file name). When users read(), they 
expect to see whatever they most recently sent to write(). They also 
expect that what they write will still be there later,  in spite of any 
failure that leaves the disk itself intact.

Operating systems cheat by not actually writing to disk -- for 
performance reasons -- and users are (mostly, usually) OK with that, 
because the performance gains are so attractive and things usually work 
out anyway. Disks cheat too, in the same way and for the same reason.

The cheating works great most of the time, but breaks down -- badly -- 
if we actually care about what is on disk after a crash (or if we use a 
network filesystem). Enough people do care that fsync() was added to the 
toolbox. It is defined to transfer "all modified in-core data of the 
file referred to by the file descriptor fd to the disk device" and 
"blocks until the device reports that the transfer has completed" 
(quoting from the fsync(2) man page). Translation: "Stop cheating. Make 
sure the stuff I already wrote actually got written. And tell the disk 
to stop cheating, too."

Problem is, this definition is asymmetric: it says what happens to 
writes issued before the fsync, but nothing about those issued after the 
fsync starts and before it returns [1]. The reader has to assume  
fsync() makes no promises whatsoever about these later writes: making 
fsync capture them exposes callers of fsync() to DoS attacks, and them 
from reaching disk until all outstanding fsync calls complete would add 
complexity the spec doesn't currently demand, leading to understandable 
reluctance by kernel devs to code it up. Unfortunately, we're left with 
the filesystem equivalent of what we in the database world call 
"eventual consistency" -- easy to implement, nice and fast, but very 
difficult to write reliable code against unless you're willing to pay 
the cost of being fully synchronous, all the time. Having tried that for 
a few years, many people are "returning" to better-specified concurrency 
models, trading some amount of performance for comfort that the app will 
at least work predictably when things go wrong in strange and 
unanticipated ways.

The request, then, is to tighten up fsync semantics in two conceptually 
straightforward ways [2]: First, guarantee that later writes to an fd do 
not hit disk until earlier calls to fsync() complete. Second, make the 
call asynchronous. That's all.

Note that both changes are necessary. The improved ordering semantic 
useless by itself, because it's still not safe to request a blocking 
fsync from one thread and and then let other threads continue issuing 
writes: there's a race between broadcasting that fsync has begun and 
issuing the actual syscall that begins it. An asynchronous fsync is also 
useless by itself, because it only benefits uncoordinated writes (which 
evidently don't care what data actually reaches disk anyway).

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to 
the affected file, wait for the device to report success, issue a cache 
flush to the device (or request ordering commands, if available) to make 
it tell the truth, and wait for the device to report success. AFAIK this 
already happens, but without taking advantage of any request ordering 
commands.
2. The requesting thread returns as soon as the kernel has identified 
all data that will be written back. This is new, but pretty similar to 
what AIO already does.
3. No write is allowed to enqueue any requests at the device that 
involve the same file, until all outstanding fsync complete [3]. This is 
new.

The performance hit for #1 can be reduced significantly if the storage 
hardware at hand happens to support some form of request ordering. The 
amount of reduction could vary greatly depending on how sophisticated 
such request ordering is, and how much effort the kernel and/or device 
driver are willing to work for it. In any case, fsync should already do 
this [4].

The performance hit for #3 can be minimized by buffering small or 
otherwise convenient writes in the fs cache and letting the call return 
immediately, as usual. The corresponding pages just have to be marked in 
some way to prevent them from being written back too soon. Sequence 
numbers work well for this sort of thing. Big requests may have to 
block, but they probably would have anyway, if the buffer cache couldn't 
absorb them. As with #1, fancy command ordering capabilities in the 
underlying device just allow additional performance optimizations.

A carefully-written app (e.g. free of I/O races) would do pretty well 
with this extended fsync, certainly far better than the current state of 
the art allows.

Note that this still offers no protection for reads: no matter how many 
times a thread issues fsync(), it still risks reading non-durable data 
because reads are not ordered wrt either writes or fsync. That's not the 
problem we're trying to solve, though.

Please feel free to point out where I've gone wrong, but this just 
doesn't look like as complex or crazy an idea as you make it out to be.

[1] Maybe posix.1-1001 is more specific, but it's not publicly available 
that I could see.

[2] I'm fully aware that implementing the request might require 
significant -- perhaps even unreasonably complex -- changes to the way 
the kernel currently does things (though I do doubt it). That's not a 
good excuse to claim the idea itself is unreasonably complex or 
ill-specified. Just say that it's not a good fit for the current code base.

[3]  Another concern is whether fsync calls operate on the file or a 
particular fd. What if a process opens the same file multiple times, or 
multiple processes have fds pointing to the same file (whether by open 
or fork)? I would argue for file-level barriers, because it leads to a 
vastly simpler design (the fs cache doesn't track which process wrote 
what via what fd). Besides, no app that cares about what ends up on disk 
will allow uncoordinated writes anyway, so why do extra work just to 
ensure I/O races stay fast?

[4] Really, device support for request ordering commands is a bit of a 
red herring: the only way it helps significantly is if (a) the storage 
device has a massive cache compared to the fs cache, (b) it allows I/O 
scheduling to reduce latency of reads and/or writes (which fsync should 
do already, and which matters little for flash), and (c) a logging 
filesystem is not being used (else it's all sequential writes anyway). 
In other words, it can help performance a bit but has little other 
impact on what is essentially a software matter.

>
>> There's a lot to be said for simplicity... as long as the system is
>> not so simple as to not work at all.
>>
>> My p.o.v. is that a filesystem write barrier is effectively the same
>> as fsync() with the ability to return sooner (before writes hit stable
>> storage) when the filesystem and hardware support on-disk layouts and
>> primitives which can be used to order writes preceding and succeeding
>> the barrier.
>
> Your mistake is that you are considering barriers as something real, 
> which can do something real for you, while it is just a artificial 
> abstraction apparently invented by people with limited knowledge how 
> storage works, hence having very foggy vision how barriers supposed to 
> be processed by it. A simple wrong answer.
Storage: Accepts writes and ostensibly makes them available via reads 
even after power failures. Reorders requests nearly arbitrarily and lies 
about whether writes actually took effect, unless you issue appropriate 
cache flushing and/or request ordering commands (and sometimes even 
then, if it was a cheap consumer drive).

OS: Accepts writes and ostensibly makes them available via reads even 
after power failures, reboots, etc. Reorders requests nearly arbitrarily 
and lies about whether writes actually took effect, unless you issue a 
stop-the-world, one-sided write barrier lovingly known as fsync 
(assuming the actually disk listens when you tell it to stop cheating).

Wish: a two-sided write barrier that not only ensures previously-issued 
writes complete before it reports success, but also prevents 
later-issued writes from completing while it is in progress, giving a 
reasonably simple way to enforce some ordering of writes in the system. 
Can be implemented entirely in software, as the latter has full control 
over which requests it chooses to schedule at the device, and also 
decides whether to block the requesting thread or not. Can be made 
virtually as fast as current writes, by maintaining a little extra 
information in the fs cache.

Please, enlighten me: in what way does my limited knowledge of storage, 
or my foggy vision of what is desired, make this feature impossible to 
implement or useless if implemented?

>
> Generally, you can invent any abstraction convenient for you, but 
> farther your abstractions from reality of your hardware => less you 
> will get from it with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And 
> start instead thinking about offload capabilities your storage can 
> offer to you.
Apologies if this comes off as flame-bait, but I start to wonder whose 
abstraction is broken here...

What I understand the above to mean is: "Linux file system abstractions 
are too far from the reality of storage hardware, so it takes lots of 
effort to accomplish little [in the way of enforcing write ordering]. 
Accept it. And start thinking instead about talking directly to a 
storage controller that offers proper write barriers."

I hope I misread what you said, because that's a depressing thing to 
hear from your OS.

Ryan


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-15 17:06                                                             ` Ryan Johnson
  0 siblings, 0 replies; 154+ messages in thread
From: Ryan Johnson @ 2012-11-15 17:06 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: Vladislav Bolkhovitin, Richard Hipp, linux-kernel,
	Theodore Ts'o, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 14/11/2012 8:17 PM, Vladislav Bolkhovitin wrote:
> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning 
> commands order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial 
> abstraction. After all, at the bottom of your stack, you will have to 
> translate it either to cache flush, or commands order enforcement, or 
> both.
Isn't that  why we *have* "the stack" in the first place? So apps 
*don't* have to worry about how the OS implements an artificial (= 
high-level and portable) abstraction on a given device?

>
> Are you going to invent 3 types of barriers?
One will do, it just needs to be a good one.

Maybe I'm missing something here, so I'm going to back up a bit and 
recap what I understand.

The filesystem abstracts the concept of encoding patterns of bits in 
some physical media (data), and making it easy to find and retrieve 
those bits later (metadata, incl. file name). When users read(), they 
expect to see whatever they most recently sent to write(). They also 
expect that what they write will still be there later,  in spite of any 
failure that leaves the disk itself intact.

Operating systems cheat by not actually writing to disk -- for 
performance reasons -- and users are (mostly, usually) OK with that, 
because the performance gains are so attractive and things usually work 
out anyway. Disks cheat too, in the same way and for the same reason.

The cheating works great most of the time, but breaks down -- badly -- 
if we actually care about what is on disk after a crash (or if we use a 
network filesystem). Enough people do care that fsync() was added to the 
toolbox. It is defined to transfer "all modified in-core data of the 
file referred to by the file descriptor fd to the disk device" and 
"blocks until the device reports that the transfer has completed" 
(quoting from the fsync(2) man page). Translation: "Stop cheating. Make 
sure the stuff I already wrote actually got written. And tell the disk 
to stop cheating, too."

Problem is, this definition is asymmetric: it says what happens to 
writes issued before the fsync, but nothing about those issued after the 
fsync starts and before it returns [1]. The reader has to assume  
fsync() makes no promises whatsoever about these later writes: making 
fsync capture them exposes callers of fsync() to DoS attacks, and them 
from reaching disk until all outstanding fsync calls complete would add 
complexity the spec doesn't currently demand, leading to understandable 
reluctance by kernel devs to code it up. Unfortunately, we're left with 
the filesystem equivalent of what we in the database world call 
"eventual consistency" -- easy to implement, nice and fast, but very 
difficult to write reliable code against unless you're willing to pay 
the cost of being fully synchronous, all the time. Having tried that for 
a few years, many people are "returning" to better-specified concurrency 
models, trading some amount of performance for comfort that the app will 
at least work predictably when things go wrong in strange and 
unanticipated ways.

The request, then, is to tighten up fsync semantics in two conceptually 
straightforward ways [2]: First, guarantee that later writes to an fd do 
not hit disk until earlier calls to fsync() complete. Second, make the 
call asynchronous. That's all.

Note that both changes are necessary. The improved ordering semantic 
useless by itself, because it's still not safe to request a blocking 
fsync from one thread and and then let other threads continue issuing 
writes: there's a race between broadcasting that fsync has begun and 
issuing the actual syscall that begins it. An asynchronous fsync is also 
useless by itself, because it only benefits uncoordinated writes (which 
evidently don't care what data actually reaches disk anyway).

The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to 
the affected file, wait for the device to report success, issue a cache 
flush to the device (or request ordering commands, if available) to make 
it tell the truth, and wait for the device to report success. AFAIK this 
already happens, but without taking advantage of any request ordering 
commands.
2. The requesting thread returns as soon as the kernel has identified 
all data that will be written back. This is new, but pretty similar to 
what AIO already does.
3. No write is allowed to enqueue any requests at the device that 
involve the same file, until all outstanding fsync complete [3]. This is 
new.

The performance hit for #1 can be reduced significantly if the storage 
hardware at hand happens to support some form of request ordering. The 
amount of reduction could vary greatly depending on how sophisticated 
such request ordering is, and how much effort the kernel and/or device 
driver are willing to work for it. In any case, fsync should already do 
this [4].

The performance hit for #3 can be minimized by buffering small or 
otherwise convenient writes in the fs cache and letting the call return 
immediately, as usual. The corresponding pages just have to be marked in 
some way to prevent them from being written back too soon. Sequence 
numbers work well for this sort of thing. Big requests may have to 
block, but they probably would have anyway, if the buffer cache couldn't 
absorb them. As with #1, fancy command ordering capabilities in the 
underlying device just allow additional performance optimizations.

A carefully-written app (e.g. free of I/O races) would do pretty well 
with this extended fsync, certainly far better than the current state of 
the art allows.

Note that this still offers no protection for reads: no matter how many 
times a thread issues fsync(), it still risks reading non-durable data 
because reads are not ordered wrt either writes or fsync. That's not the 
problem we're trying to solve, though.

Please feel free to point out where I've gone wrong, but this just 
doesn't look like as complex or crazy an idea as you make it out to be.

[1] Maybe posix.1-1001 is more specific, but it's not publicly available 
that I could see.

[2] I'm fully aware that implementing the request might require 
significant -- perhaps even unreasonably complex -- changes to the way 
the kernel currently does things (though I do doubt it). That's not a 
good excuse to claim the idea itself is unreasonably complex or 
ill-specified. Just say that it's not a good fit for the current code base.

[3]  Another concern is whether fsync calls operate on the file or a 
particular fd. What if a process opens the same file multiple times, or 
multiple processes have fds pointing to the same file (whether by open 
or fork)? I would argue for file-level barriers, because it leads to a 
vastly simpler design (the fs cache doesn't track which process wrote 
what via what fd). Besides, no app that cares about what ends up on disk 
will allow uncoordinated writes anyway, so why do extra work just to 
ensure I/O races stay fast?

[4] Really, device support for request ordering commands is a bit of a 
red herring: the only way it helps significantly is if (a) the storage 
device has a massive cache compared to the fs cache, (b) it allows I/O 
scheduling to reduce latency of reads and/or writes (which fsync should 
do already, and which matters little for flash), and (c) a logging 
filesystem is not being used (else it's all sequential writes anyway). 
In other words, it can help performance a bit but has little other 
impact on what is essentially a software matter.

>
>> There's a lot to be said for simplicity... as long as the system is
>> not so simple as to not work at all.
>>
>> My p.o.v. is that a filesystem write barrier is effectively the same
>> as fsync() with the ability to return sooner (before writes hit stable
>> storage) when the filesystem and hardware support on-disk layouts and
>> primitives which can be used to order writes preceding and succeeding
>> the barrier.
>
> Your mistake is that you are considering barriers as something real, 
> which can do something real for you, while it is just a artificial 
> abstraction apparently invented by people with limited knowledge how 
> storage works, hence having very foggy vision how barriers supposed to 
> be processed by it. A simple wrong answer.
Storage: Accepts writes and ostensibly makes them available via reads 
even after power failures. Reorders requests nearly arbitrarily and lies 
about whether writes actually took effect, unless you issue appropriate 
cache flushing and/or request ordering commands (and sometimes even 
then, if it was a cheap consumer drive).

OS: Accepts writes and ostensibly makes them available via reads even 
after power failures, reboots, etc. Reorders requests nearly arbitrarily 
and lies about whether writes actually took effect, unless you issue a 
stop-the-world, one-sided write barrier lovingly known as fsync 
(assuming the actually disk listens when you tell it to stop cheating).

Wish: a two-sided write barrier that not only ensures previously-issued 
writes complete before it reports success, but also prevents 
later-issued writes from completing while it is in progress, giving a 
reasonably simple way to enforce some ordering of writes in the system. 
Can be implemented entirely in software, as the latter has full control 
over which requests it chooses to schedule at the device, and also 
decides whether to block the requesting thread or not. Can be made 
virtually as fast as current writes, by maintaining a little extra 
information in the fs cache.

Please, enlighten me: in what way does my limited knowledge of storage, 
or my foggy vision of what is desired, make this feature impossible to 
implement or useless if implemented?

>
> Generally, you can invent any abstraction convenient for you, but 
> farther your abstractions from reality of your hardware => less you 
> will get from it with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And 
> start instead thinking about offload capabilities your storage can 
> offer to you.
Apologies if this comes off as flame-bait, but I start to wonder whose 
abstraction is broken here...

What I understand the above to mean is: "Linux file system abstractions 
are too far from the reality of storage hardware, so it takes lots of 
effort to accomplish little [in the way of enforcing write ordering]. 
Accept it. And start thinking instead about talking directly to a 
storage controller that offers proper write barriers."

I hope I misread what you said, because that's a depressing thing to 
hear from your OS.

Ryan

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15 17:06                                                             ` Ryan Johnson
  (?)
@ 2012-11-15 22:35                                                             ` Chris Friesen
  2012-11-17  5:02                                                                 ` Vladislav Bolkhovitin
  -1 siblings, 1 reply; 154+ messages in thread
From: Chris Friesen @ 2012-11-15 22:35 UTC (permalink / raw)
  To: Ryan Johnson
  Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin,
	Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel,
	Richard Hipp

On 11/15/2012 11:06 AM, Ryan Johnson wrote:

> The easiest way to implement this fsync would involve three things:
> 1. Schedule writes for all dirty pages in the fs cache that belong to
> the affected file, wait for the device to report success, issue a cache
> flush to the device (or request ordering commands, if available) to make
> it tell the truth, and wait for the device to report success. AFAIK this
> already happens, but without taking advantage of any request ordering
> commands.
> 2. The requesting thread returns as soon as the kernel has identified
> all data that will be written back. This is new, but pretty similar to
> what AIO already does.
> 3. No write is allowed to enqueue any requests at the device that
> involve the same file, until all outstanding fsync complete [3]. This is
> new.

This sounds interesting as a way to expose some useful semantics to 
userspace.

I assume we'd need to come up with a new syscall or something since it 
doesn't match the behaviour of posix fsync().

Chris

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
       [not found]                                                             ` <alpine.DEB.2.02.1211150353080.32408-UEhY+ZBZOcqqLGM74eQ/YA@public.gmane.org>
@ 2012-11-16 15:06                                                               ` Howard Chu
  0 siblings, 0 replies; 154+ messages in thread
From: Howard Chu @ 2012-11-16 15:06 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: David Lang, Vladislav Bolkhovitin, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel

David Lang wrote:
> barriers keep getting mentioned because they are a easy concept to understand.
> "do this set of stuff before doing any of this other set of stuff, but I don't
> care when any of this gets done" and they fit well with the requirements of the
> users.
>
> Users readily accept that if the system crashes, they will loose the most recent
> stuff that they did,

*some* users may accept that. *None* should.

> but they get annoyed when things get corrupted to the point
> that they loose the entire file.
>
> this includes things like modifying one option and a crash resulting in the
> config file being blank. Yes, you can do the 'write to temp file, sync file,
> sync directory, rename file" dance, but the fact that to do so the user must sit
> and wait for the syncs to take place can be a problem. It would be far better to
> be able to say "write to temp file, and after it's on disk, rename the file" and
> not have the user wait. The user doesn't really care if the changes hit disk
> immediately, or several seconds (or even 10s of seconds) later, as long as there
> is not any possibility of the rename hitting disk before the file contents.
>
> The fact that this could be implemented in multiple ways in the existing
> hardware does not mean that there need to be multiple ways exposed to userspace,
> it just means that the cost of doing the operation will vary depending on the
> hardware that you have. This also means that if new hardware introduces a new
> way of implementing this, that improvement can be passed on to the users without
> needing application changes.

There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it 
because they don't know better. We programmers, who know better, have failed 
to raise a stink and demand that this be fixed.
   A) Drives should not lose data on power failure. If a drive accepts a write 
request and says "OK, done" then that data should get written to stable 
storage, period. Whether it requires capacitors or some other onboard power 
supply, or whatever, they should just do it. Keep in mind that today, most of 
the difference between enterprise drives and consumer desktop drives is just a 
firmware change, that hardware is already identical. Nobody should accept a 
product that doesn't offer this guarantee. It's inexcusable.
   B) it should go without saying - drives should reliably report back to the 
host, when something goes wrong. E.g., if a write request has been accepted, 
cached, and reported complete, but then during the actual write an ECC failure 
is detected in the cacheline, the drive needs to tell the host "oh by the way, 
block XXX didn't actually make it to disk like I told you it did 10ms ago."

If the entire software industry were to simply state "your shit stinks and 
we're not going to take it any more" the hard drive industry would have no 
choice but to fix it. And in most cases it would be a zero-cost fix for them.

Once you have drives that are actually trustworthy, actually reliable (which 
doesn't mean they never fail, it only means they tell the truth about 
successes or failures), most of these other issues disappear. Most of the need 
for barriers disappear.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-16 15:06                                                               ` Howard Chu
  0 siblings, 0 replies; 154+ messages in thread
From: Howard Chu @ 2012-11-16 15:06 UTC (permalink / raw)
  To: General Discussion of SQLite Database
  Cc: David Lang, Theodore Ts'o, Richard Hipp, linux-kernel,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vladislav Bolkhovitin

David Lang wrote:
> barriers keep getting mentioned because they are a easy concept to understand.
> "do this set of stuff before doing any of this other set of stuff, but I don't
> care when any of this gets done" and they fit well with the requirements of the
> users.
>
> Users readily accept that if the system crashes, they will loose the most recent
> stuff that they did,

*some* users may accept that. *None* should.

> but they get annoyed when things get corrupted to the point
> that they loose the entire file.
>
> this includes things like modifying one option and a crash resulting in the
> config file being blank. Yes, you can do the 'write to temp file, sync file,
> sync directory, rename file" dance, but the fact that to do so the user must sit
> and wait for the syncs to take place can be a problem. It would be far better to
> be able to say "write to temp file, and after it's on disk, rename the file" and
> not have the user wait. The user doesn't really care if the changes hit disk
> immediately, or several seconds (or even 10s of seconds) later, as long as there
> is not any possibility of the rename hitting disk before the file contents.
>
> The fact that this could be implemented in multiple ways in the existing
> hardware does not mean that there need to be multiple ways exposed to userspace,
> it just means that the cost of doing the operation will vary depending on the
> hardware that you have. This also means that if new hardware introduces a new
> way of implementing this, that improvement can be passed on to the users without
> needing application changes.

There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it 
because they don't know better. We programmers, who know better, have failed 
to raise a stink and demand that this be fixed.
   A) Drives should not lose data on power failure. If a drive accepts a write 
request and says "OK, done" then that data should get written to stable 
storage, period. Whether it requires capacitors or some other onboard power 
supply, or whatever, they should just do it. Keep in mind that today, most of 
the difference between enterprise drives and consumer desktop drives is just a 
firmware change, that hardware is already identical. Nobody should accept a 
product that doesn't offer this guarantee. It's inexcusable.
   B) it should go without saying - drives should reliably report back to the 
host, when something goes wrong. E.g., if a write request has been accepted, 
cached, and reported complete, but then during the actual write an ECC failure 
is detected in the cacheline, the drive needs to tell the host "oh by the way, 
block XXX didn't actually make it to disk like I told you it did 10ms ago."

If the entire software industry were to simply state "your shit stinks and 
we're not going to take it any more" the hard drive industry would have no 
choice but to fix it. And in most cases it would be a zero-cost fix for them.

Once you have drives that are actually trustworthy, actually reliable (which 
doesn't mean they never fail, it only means they tell the truth about 
successes or failures), most of these other issues disappear. Most of the need 
for barriers disappear.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-16 15:06                                                               ` Howard Chu
  (?)
@ 2012-11-16 15:31                                                               ` Ric Wheeler
  2012-11-16 15:54                                                                   ` Howard Chu
  -1 siblings, 1 reply; 154+ messages in thread
From: Ric Wheeler @ 2012-11-16 15:31 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, David Lang,
	Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

On 11/16/2012 10:06 AM, Howard Chu wrote:
> David Lang wrote:
>> barriers keep getting mentioned because they are a easy concept to understand.
>> "do this set of stuff before doing any of this other set of stuff, but I don't
>> care when any of this gets done" and they fit well with the requirements of the
>> users.
>>
>> Users readily accept that if the system crashes, they will loose the most recent
>> stuff that they did,
>
> *some* users may accept that. *None* should.
>
>> but they get annoyed when things get corrupted to the point
>> that they loose the entire file.
>>
>> this includes things like modifying one option and a crash resulting in the
>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>> sync directory, rename file" dance, but the fact that to do so the user must sit
>> and wait for the syncs to take place can be a problem. It would be far better to
>> be able to say "write to temp file, and after it's on disk, rename the file" and
>> not have the user wait. The user doesn't really care if the changes hit disk
>> immediately, or several seconds (or even 10s of seconds) later, as long as there
>> is not any possibility of the rename hitting disk before the file contents.
>>
>> The fact that this could be implemented in multiple ways in the existing
>> hardware does not mean that there need to be multiple ways exposed to userspace,
>> it just means that the cost of doing the operation will vary depending on the
>> hardware that you have. This also means that if new hardware introduces a new
>> way of implementing this, that improvement can be passed on to the users without
>> needing application changes.
>
> There are a couple industry failures here:
>
> 1) the drive manufacturers sell drives that lie, and consumers accept it 
> because they don't know better. We programmers, who know better, have failed 
> to raise a stink and demand that this be fixed.
>   A) Drives should not lose data on power failure. If a drive accepts a write 
> request and says "OK, done" then that data should get written to stable 
> storage, period. Whether it requires capacitors or some other onboard power 
> supply, or whatever, they should just do it. Keep in mind that today, most of 
> the difference between enterprise drives and consumer desktop drives is just a 
> firmware change, that hardware is already identical. Nobody should accept a 
> product that doesn't offer this guarantee. It's inexcusable.
>   B) it should go without saying - drives should reliably report back to the 
> host, when something goes wrong. E.g., if a write request has been accepted, 
> cached, and reported complete, but then during the actual write an ECC failure 
> is detected in the cacheline, the drive needs to tell the host "oh by the way, 
> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>
> If the entire software industry were to simply state "your shit stinks and 
> we're not going to take it any more" the hard drive industry would have no 
> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>
> Once you have drives that are actually trustworthy, actually reliable (which 
> doesn't mean they never fail, it only means they tell the truth about 
> successes or failures), most of these other issues disappear. Most of the need 
> for barriers disappear.
>

I think that you are arguing a fairly silly point.

If you want that behaviour, you have had it for more than a decade - simply 
disable the write cache on your drive and you are done.

If you - as a user - want to run faster and use applications that are coded to 
handle data integrity properly (fsync, fdatasync, etc), leave the write cache 
enabled and use file system barriers.

Everyone has to trade off cost versus something else and this is a very, very 
long standing trade off that drive manufacturers have made.

The more money you pay for your storage, the less likely this is to be an issue 
(high end SSD's, enterprise class arrays, etc don't have volatile write caches 
and most SAS drives perform reasonably well with the write cache disabled).

Regards,

Ric



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-16 15:54                                                                   ` Howard Chu
  0 siblings, 0 replies; 154+ messages in thread
From: Howard Chu @ 2012-11-16 15:54 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: General Discussion of SQLite Database, David Lang,
	Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

Ric Wheeler wrote:
> On 11/16/2012 10:06 AM, Howard Chu wrote:
>> David Lang wrote:
>>> barriers keep getting mentioned because they are a easy concept to understand.
>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>> care when any of this gets done" and they fit well with the requirements of the
>>> users.
>>>
>>> Users readily accept that if the system crashes, they will loose the most recent
>>> stuff that they did,
>>
>> *some* users may accept that. *None* should.
>>
>>> but they get annoyed when things get corrupted to the point
>>> that they loose the entire file.
>>>
>>> this includes things like modifying one option and a crash resulting in the
>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>> sync directory, rename file" dance, but the fact that to do so the user must sit
>>> and wait for the syncs to take place can be a problem. It would be far better to
>>> be able to say "write to temp file, and after it's on disk, rename the file" and
>>> not have the user wait. The user doesn't really care if the changes hit disk
>>> immediately, or several seconds (or even 10s of seconds) later, as long as there
>>> is not any possibility of the rename hitting disk before the file contents.
>>>
>>> The fact that this could be implemented in multiple ways in the existing
>>> hardware does not mean that there need to be multiple ways exposed to userspace,
>>> it just means that the cost of doing the operation will vary depending on the
>>> hardware that you have. This also means that if new hardware introduces a new
>>> way of implementing this, that improvement can be passed on to the users without
>>> needing application changes.
>>
>> There are a couple industry failures here:
>>
>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>> because they don't know better. We programmers, who know better, have failed
>> to raise a stink and demand that this be fixed.
>>    A) Drives should not lose data on power failure. If a drive accepts a write
>> request and says "OK, done" then that data should get written to stable
>> storage, period. Whether it requires capacitors or some other onboard power
>> supply, or whatever, they should just do it. Keep in mind that today, most of
>> the difference between enterprise drives and consumer desktop drives is just a
>> firmware change, that hardware is already identical. Nobody should accept a
>> product that doesn't offer this guarantee. It's inexcusable.
>>    B) it should go without saying - drives should reliably report back to the
>> host, when something goes wrong. E.g., if a write request has been accepted,
>> cached, and reported complete, but then during the actual write an ECC failure
>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>
>> If the entire software industry were to simply state "your shit stinks and
>> we're not going to take it any more" the hard drive industry would have no
>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>
>> Once you have drives that are actually trustworthy, actually reliable (which
>> doesn't mean they never fail, it only means they tell the truth about
>> successes or failures), most of these other issues disappear. Most of the need
>> for barriers disappear.
>>
>
> I think that you are arguing a fairly silly point.

Seems to me that you're arguing that we should accept inferior technology. 
Who's really being silly?

> If you want that behaviour, you have had it for more than a decade - simply
> disable the write cache on your drive and you are done.

You seem to believe it's nonsensical for someone to want both fast and 
reliable writes, or that it's unreasonable for a storage device to offer the 
same, cheaply. And yet it is clearly trivial to provide all of the above.

> If you - as a user - want to run faster and use applications that are coded to
> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
> enabled and use file system barriers.

Applications aren't supposed to need to worry about such details, that's why 
we have operating systems.

Drives should tell the truth. In event of an error detected after the fact, 
the drive should report the error back to the host. There's nothing 
nonsensical there.

When a drive's cache is enabled, the host should maintain a queue of written 
pages, of a length equal to the size of the drive's cache. If a drive says 
"hey, block XXX failed" the OS can reissue the write from its own queue. No 
muss, no fuss, no performance bottlenecks. This is what Real Computers did 
before the age of VAX Unix.

> Everyone has to trade off cost versus something else and this is a very, very
> long standing trade off that drive manufacturers have made.

With the cost of storage falling as rapidly as it has in recent years, this is 
a stupid tradeoff.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-16 15:54                                                                   ` Howard Chu
  0 siblings, 0 replies; 154+ messages in thread
From: Howard Chu @ 2012-11-16 15:54 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: David Lang, General Discussion of SQLite Database,
	Theodore Ts'o, Richard Hipp, linux-kernel,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Vladislav Bolkhovitin

Ric Wheeler wrote:
> On 11/16/2012 10:06 AM, Howard Chu wrote:
>> David Lang wrote:
>>> barriers keep getting mentioned because they are a easy concept to understand.
>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>> care when any of this gets done" and they fit well with the requirements of the
>>> users.
>>>
>>> Users readily accept that if the system crashes, they will loose the most recent
>>> stuff that they did,
>>
>> *some* users may accept that. *None* should.
>>
>>> but they get annoyed when things get corrupted to the point
>>> that they loose the entire file.
>>>
>>> this includes things like modifying one option and a crash resulting in the
>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>> sync directory, rename file" dance, but the fact that to do so the user must sit
>>> and wait for the syncs to take place can be a problem. It would be far better to
>>> be able to say "write to temp file, and after it's on disk, rename the file" and
>>> not have the user wait. The user doesn't really care if the changes hit disk
>>> immediately, or several seconds (or even 10s of seconds) later, as long as there
>>> is not any possibility of the rename hitting disk before the file contents.
>>>
>>> The fact that this could be implemented in multiple ways in the existing
>>> hardware does not mean that there need to be multiple ways exposed to userspace,
>>> it just means that the cost of doing the operation will vary depending on the
>>> hardware that you have. This also means that if new hardware introduces a new
>>> way of implementing this, that improvement can be passed on to the users without
>>> needing application changes.
>>
>> There are a couple industry failures here:
>>
>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>> because they don't know better. We programmers, who know better, have failed
>> to raise a stink and demand that this be fixed.
>>    A) Drives should not lose data on power failure. If a drive accepts a write
>> request and says "OK, done" then that data should get written to stable
>> storage, period. Whether it requires capacitors or some other onboard power
>> supply, or whatever, they should just do it. Keep in mind that today, most of
>> the difference between enterprise drives and consumer desktop drives is just a
>> firmware change, that hardware is already identical. Nobody should accept a
>> product that doesn't offer this guarantee. It's inexcusable.
>>    B) it should go without saying - drives should reliably report back to the
>> host, when something goes wrong. E.g., if a write request has been accepted,
>> cached, and reported complete, but then during the actual write an ECC failure
>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>
>> If the entire software industry were to simply state "your shit stinks and
>> we're not going to take it any more" the hard drive industry would have no
>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>
>> Once you have drives that are actually trustworthy, actually reliable (which
>> doesn't mean they never fail, it only means they tell the truth about
>> successes or failures), most of these other issues disappear. Most of the need
>> for barriers disappear.
>>
>
> I think that you are arguing a fairly silly point.

Seems to me that you're arguing that we should accept inferior technology. 
Who's really being silly?

> If you want that behaviour, you have had it for more than a decade - simply
> disable the write cache on your drive and you are done.

You seem to believe it's nonsensical for someone to want both fast and 
reliable writes, or that it's unreasonable for a storage device to offer the 
same, cheaply. And yet it is clearly trivial to provide all of the above.

> If you - as a user - want to run faster and use applications that are coded to
> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
> enabled and use file system barriers.

Applications aren't supposed to need to worry about such details, that's why 
we have operating systems.

Drives should tell the truth. In event of an error detected after the fact, 
the drive should report the error back to the host. There's nothing 
nonsensical there.

When a drive's cache is enabled, the host should maintain a queue of written 
pages, of a length equal to the size of the drive's cache. If a drive says 
"hey, block XXX failed" the OS can reissue the write from its own queue. No 
muss, no fuss, no performance bottlenecks. This is what Real Computers did 
before the age of VAX Unix.

> Everyone has to trade off cost versus something else and this is a very, very
> long standing trade off that drive manufacturers have made.

With the cost of storage falling as rapidly as it has in recent years, this is 
a stupid tradeoff.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-16 15:54                                                                   ` Howard Chu
@ 2012-11-16 18:03                                                                     ` Ric Wheeler
  -1 siblings, 0 replies; 154+ messages in thread
From: Ric Wheeler @ 2012-11-16 18:03 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, David Lang,
	Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

On 11/16/2012 10:54 AM, Howard Chu wrote:
> Ric Wheeler wrote:
>> On 11/16/2012 10:06 AM, Howard Chu wrote:
>>> David Lang wrote:
>>>> barriers keep getting mentioned because they are a easy concept to understand.
>>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>>> care when any of this gets done" and they fit well with the requirements of 
>>>> the
>>>> users.
>>>>
>>>> Users readily accept that if the system crashes, they will loose the most 
>>>> recent
>>>> stuff that they did,
>>>
>>> *some* users may accept that. *None* should.
>>>
>>>> but they get annoyed when things get corrupted to the point
>>>> that they loose the entire file.
>>>>
>>>> this includes things like modifying one option and a crash resulting in the
>>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>>> sync directory, rename file" dance, but the fact that to do so the user 
>>>> must sit
>>>> and wait for the syncs to take place can be a problem. It would be far 
>>>> better to
>>>> be able to say "write to temp file, and after it's on disk, rename the 
>>>> file" and
>>>> not have the user wait. The user doesn't really care if the changes hit disk
>>>> immediately, or several seconds (or even 10s of seconds) later, as long as 
>>>> there
>>>> is not any possibility of the rename hitting disk before the file contents.
>>>>
>>>> The fact that this could be implemented in multiple ways in the existing
>>>> hardware does not mean that there need to be multiple ways exposed to 
>>>> userspace,
>>>> it just means that the cost of doing the operation will vary depending on the
>>>> hardware that you have. This also means that if new hardware introduces a new
>>>> way of implementing this, that improvement can be passed on to the users 
>>>> without
>>>> needing application changes.
>>>
>>> There are a couple industry failures here:
>>>
>>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>>> because they don't know better. We programmers, who know better, have failed
>>> to raise a stink and demand that this be fixed.
>>>    A) Drives should not lose data on power failure. If a drive accepts a write
>>> request and says "OK, done" then that data should get written to stable
>>> storage, period. Whether it requires capacitors or some other onboard power
>>> supply, or whatever, they should just do it. Keep in mind that today, most of
>>> the difference between enterprise drives and consumer desktop drives is just a
>>> firmware change, that hardware is already identical. Nobody should accept a
>>> product that doesn't offer this guarantee. It's inexcusable.
>>>    B) it should go without saying - drives should reliably report back to the
>>> host, when something goes wrong. E.g., if a write request has been accepted,
>>> cached, and reported complete, but then during the actual write an ECC failure
>>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>>
>>> If the entire software industry were to simply state "your shit stinks and
>>> we're not going to take it any more" the hard drive industry would have no
>>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>>
>>> Once you have drives that are actually trustworthy, actually reliable (which
>>> doesn't mean they never fail, it only means they tell the truth about
>>> successes or failures), most of these other issues disappear. Most of the need
>>> for barriers disappear.
>>>
>>
>> I think that you are arguing a fairly silly point.
>
> Seems to me that you're arguing that we should accept inferior technology. 
> Who's really being silly?

No, just suggesting that you either pay for the expensive stuff or learn how to 
use cost effective, high capacity storage like the rest of the world.

I don't disagree that having non-volatile write caches would be nice, but 
everyone has learned how to deal with volatile write caches at the low end of 
market.

>
>> If you want that behaviour, you have had it for more than a decade - simply
>> disable the write cache on your drive and you are done.
>
> You seem to believe it's nonsensical for someone to want both fast and 
> reliable writes, or that it's unreasonable for a storage device to offer the 
> same, cheaply. And yet it is clearly trivial to provide all of the above.

I look forward to seeing your products in the market.

Until you have more than "I want" and "I think" on your storage system design 
resume, I suggest you spend the money to get the parts with non-volatile write 
caches or fix your code.

Ric


>> If you - as a user - want to run faster and use applications that are coded to
>> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
>> enabled and use file system barriers.
>
> Applications aren't supposed to need to worry about such details, that's why 
> we have operating systems.
>
> Drives should tell the truth. In event of an error detected after the fact, 
> the drive should report the error back to the host. There's nothing 
> nonsensical there.
>
> When a drive's cache is enabled, the host should maintain a queue of written 
> pages, of a length equal to the size of the drive's cache. If a drive says 
> "hey, block XXX failed" the OS can reissue the write from its own queue. No 
> muss, no fuss, no performance bottlenecks. This is what Real Computers did 
> before the age of VAX Unix.
>
>> Everyone has to trade off cost versus something else and this is a very, very
>> long standing trade off that drive manufacturers have made.
>
> With the cost of storage falling as rapidly as it has in recent years, this is 
> a stupid tradeoff.
>


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-16 18:03                                                                     ` Ric Wheeler
  0 siblings, 0 replies; 154+ messages in thread
From: Ric Wheeler @ 2012-11-16 18:03 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, David Lang,
	Vladislav Bolkhovitin, Theodore Ts'o, Richard Hipp,
	linux-kernel, linux-fsdevel

On 11/16/2012 10:54 AM, Howard Chu wrote:
> Ric Wheeler wrote:
>> On 11/16/2012 10:06 AM, Howard Chu wrote:
>>> David Lang wrote:
>>>> barriers keep getting mentioned because they are a easy concept to understand.
>>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>>> care when any of this gets done" and they fit well with the requirements of 
>>>> the
>>>> users.
>>>>
>>>> Users readily accept that if the system crashes, they will loose the most 
>>>> recent
>>>> stuff that they did,
>>>
>>> *some* users may accept that. *None* should.
>>>
>>>> but they get annoyed when things get corrupted to the point
>>>> that they loose the entire file.
>>>>
>>>> this includes things like modifying one option and a crash resulting in the
>>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>>> sync directory, rename file" dance, but the fact that to do so the user 
>>>> must sit
>>>> and wait for the syncs to take place can be a problem. It would be far 
>>>> better to
>>>> be able to say "write to temp file, and after it's on disk, rename the 
>>>> file" and
>>>> not have the user wait. The user doesn't really care if the changes hit disk
>>>> immediately, or several seconds (or even 10s of seconds) later, as long as 
>>>> there
>>>> is not any possibility of the rename hitting disk before the file contents.
>>>>
>>>> The fact that this could be implemented in multiple ways in the existing
>>>> hardware does not mean that there need to be multiple ways exposed to 
>>>> userspace,
>>>> it just means that the cost of doing the operation will vary depending on the
>>>> hardware that you have. This also means that if new hardware introduces a new
>>>> way of implementing this, that improvement can be passed on to the users 
>>>> without
>>>> needing application changes.
>>>
>>> There are a couple industry failures here:
>>>
>>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>>> because they don't know better. We programmers, who know better, have failed
>>> to raise a stink and demand that this be fixed.
>>>    A) Drives should not lose data on power failure. If a drive accepts a write
>>> request and says "OK, done" then that data should get written to stable
>>> storage, period. Whether it requires capacitors or some other onboard power
>>> supply, or whatever, they should just do it. Keep in mind that today, most of
>>> the difference between enterprise drives and consumer desktop drives is just a
>>> firmware change, that hardware is already identical. Nobody should accept a
>>> product that doesn't offer this guarantee. It's inexcusable.
>>>    B) it should go without saying - drives should reliably report back to the
>>> host, when something goes wrong. E.g., if a write request has been accepted,
>>> cached, and reported complete, but then during the actual write an ECC failure
>>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>>
>>> If the entire software industry were to simply state "your shit stinks and
>>> we're not going to take it any more" the hard drive industry would have no
>>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>>
>>> Once you have drives that are actually trustworthy, actually reliable (which
>>> doesn't mean they never fail, it only means they tell the truth about
>>> successes or failures), most of these other issues disappear. Most of the need
>>> for barriers disappear.
>>>
>>
>> I think that you are arguing a fairly silly point.
>
> Seems to me that you're arguing that we should accept inferior technology. 
> Who's really being silly?

No, just suggesting that you either pay for the expensive stuff or learn how to 
use cost effective, high capacity storage like the rest of the world.

I don't disagree that having non-volatile write caches would be nice, but 
everyone has learned how to deal with volatile write caches at the low end of 
market.

>
>> If you want that behaviour, you have had it for more than a decade - simply
>> disable the write cache on your drive and you are done.
>
> You seem to believe it's nonsensical for someone to want both fast and 
> reliable writes, or that it's unreasonable for a storage device to offer the 
> same, cheaply. And yet it is clearly trivial to provide all of the above.

I look forward to seeing your products in the market.

Until you have more than "I want" and "I think" on your storage system design 
resume, I suggest you spend the money to get the parts with non-volatile write 
caches or fix your code.

Ric


>> If you - as a user - want to run faster and use applications that are coded to
>> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
>> enabled and use file system barriers.
>
> Applications aren't supposed to need to worry about such details, that's why 
> we have operating systems.
>
> Drives should tell the truth. In event of an error detected after the fact, 
> the drive should report the error back to the host. There's nothing 
> nonsensical there.
>
> When a drive's cache is enabled, the host should maintain a queue of written 
> pages, of a length equal to the size of the drive's cache. If a drive says 
> "hey, block XXX failed" the OS can reissue the write from its own queue. No 
> muss, no fuss, no performance bottlenecks. This is what Real Computers did 
> before the age of VAX Unix.
>
>> Everyone has to trade off cost versus something else and this is a very, very
>> long standing trade off that drive manufacturers have made.
>
> With the cost of storage falling as rapidly as it has in recent years, this is 
> a stupid tradeoff.
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-16 19:14                                                                 ` David Lang
  0 siblings, 0 replies; 154+ messages in thread
From: David Lang @ 2012-11-16 19:14 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, Vladislav Bolkhovitin,
	Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel

On Fri, 16 Nov 2012, Howard Chu wrote:

> David Lang wrote:
>> barriers keep getting mentioned because they are a easy concept to 
>> understand.
>> "do this set of stuff before doing any of this other set of stuff, but I 
>> don't
>> care when any of this gets done" and they fit well with the requirements of 
>> the
>> users.
>> 
>> Users readily accept that if the system crashes, they will loose the most 
>> recent
>> stuff that they did,
>
> *some* users may accept that. *None* should.

when users are given a choice of having all their work be very slow, or have it 
be fast, but in the unlikely event of a crash they loose their mose recent 
changes, they are willing to loose their most recent changes.

If you think about it, this is not much different from the fact that you loose 
all changes since the last time you saved the thing you are working on. Many 
programs save state periodically so that if the application crashes the user 
hasn't lost everything, but any application that tried to save after every 
single change would be so slow that nobody would use it.

There is always going to be a window after a user hits 'save' where the data can 
be lost, because it's not yet on disk.

> There are a couple industry failures here:
>
> 1) the drive manufacturers sell drives that lie, and consumers accept it 
> because they don't know better. We programmers, who know better, have failed 
> to raise a stink and demand that this be fixed.
>  A) Drives should not lose data on power failure. If a drive accepts a write 
> request and says "OK, done" then that data should get written to stable 
> storage, period. Whether it requires capacitors or some other onboard power 
> supply, or whatever, they should just do it. Keep in mind that today, most of 
> the difference between enterprise drives and consumer desktop drives is just 
> a firmware change, that hardware is already identical. Nobody should accept a 
> product that doesn't offer this guarantee. It's inexcusable.

This is an option to you. However if you have enabled write caching and 
reordering, you have explicitly told the system to be faster at the expense of 
loosing data under some conditions. The fact that you then loose data under 
those conditions should not surprise you.

The idea that you must have enough power to write all the pending data to disk 
is problematic as that then severely limits the amount of cache that you have.

>  B) it should go without saying - drives should reliably report back to the 
> host, when something goes wrong. E.g., if a write request has been accepted, 
> cached, and reported complete, but then during the actual write an ECC 
> failure is detected in the cacheline, the drive needs to tell the host "oh by 
> the way, block XXX didn't actually make it to disk like I told you it did 
> 10ms ago."

The issue isn't a drive having a write error, it's the system shutting down 
(or crashing) before the data is written, no OS level tricks will help you here.


The real problem here isn't the drive claiming the data has been written when it 
hasn't, the real problem is that the application has said 'write this data' to 
the OS, and the OS has not done so yet.

The OS delays the writes for many legitimate reasons (the disk may be busy, it 
can get things done more efficently by combining and reordering the writes, etc)

Unless the system crashes, this is not a problem, the data will eventually be 
written out, and on system shutdown everthing is good.

But if the system crashes, some of this postphoned work doesn't get done, and 
that can be a problem.

Applications can do fsync if they want to be sure that their data is safe on 
disk NOW, but they currently have no way of saying "I want to make sure that A 
happens before B, but I don't care if A happens now or 10 seconds from now"

That is the gap that it would be useful to provide a mechanism to deal with, and 
it doesn't matter what your disk system does in terms of lieing ot not, there 
still isn't a way to deal with this today.

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-16 19:14                                                                 ` David Lang
  0 siblings, 0 replies; 154+ messages in thread
From: David Lang @ 2012-11-16 19:14 UTC (permalink / raw)
  To: Howard Chu
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Vladislav Bolkhovitin

On Fri, 16 Nov 2012, Howard Chu wrote:

> David Lang wrote:
>> barriers keep getting mentioned because they are a easy concept to 
>> understand.
>> "do this set of stuff before doing any of this other set of stuff, but I 
>> don't
>> care when any of this gets done" and they fit well with the requirements of 
>> the
>> users.
>> 
>> Users readily accept that if the system crashes, they will loose the most 
>> recent
>> stuff that they did,
>
> *some* users may accept that. *None* should.

when users are given a choice of having all their work be very slow, or have it 
be fast, but in the unlikely event of a crash they loose their mose recent 
changes, they are willing to loose their most recent changes.

If you think about it, this is not much different from the fact that you loose 
all changes since the last time you saved the thing you are working on. Many 
programs save state periodically so that if the application crashes the user 
hasn't lost everything, but any application that tried to save after every 
single change would be so slow that nobody would use it.

There is always going to be a window after a user hits 'save' where the data can 
be lost, because it's not yet on disk.

> There are a couple industry failures here:
>
> 1) the drive manufacturers sell drives that lie, and consumers accept it 
> because they don't know better. We programmers, who know better, have failed 
> to raise a stink and demand that this be fixed.
>  A) Drives should not lose data on power failure. If a drive accepts a write 
> request and says "OK, done" then that data should get written to stable 
> storage, period. Whether it requires capacitors or some other onboard power 
> supply, or whatever, they should just do it. Keep in mind that today, most of 
> the difference between enterprise drives and consumer desktop drives is just 
> a firmware change, that hardware is already identical. Nobody should accept a 
> product that doesn't offer this guarantee. It's inexcusable.

This is an option to you. However if you have enabled write caching and 
reordering, you have explicitly told the system to be faster at the expense of 
loosing data under some conditions. The fact that you then loose data under 
those conditions should not surprise you.

The idea that you must have enough power to write all the pending data to disk 
is problematic as that then severely limits the amount of cache that you have.

>  B) it should go without saying - drives should reliably report back to the 
> host, when something goes wrong. E.g., if a write request has been accepted, 
> cached, and reported complete, but then during the actual write an ECC 
> failure is detected in the cacheline, the drive needs to tell the host "oh by 
> the way, block XXX didn't actually make it to disk like I told you it did 
> 10ms ago."

The issue isn't a drive having a write error, it's the system shutting down 
(or crashing) before the data is written, no OS level tricks will help you here.


The real problem here isn't the drive claiming the data has been written when it 
hasn't, the real problem is that the application has said 'write this data' to 
the OS, and the OS has not done so yet.

The OS delays the writes for many legitimate reasons (the disk may be busy, it 
can get things done more efficently by combining and reordering the writes, etc)

Unless the system crashes, this is not a problem, the data will eventually be 
written out, and on system shutdown everthing is good.

But if the system crashes, some of this postphoned work doesn't get done, and 
that can be a problem.

Applications can do fsync if they want to be sure that their data is safe on 
disk NOW, but they currently have no way of saying "I want to make sure that A 
happens before B, but I don't care if A happens now or 10 seconds from now"

That is the gap that it would be useful to provide a mechanism to deal with, and 
it doesn't matter what your disk system does in terms of lieing ot not, there 
still isn't a way to deal with this today.

David Lang

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15 12:07                                                             ` David Lang
@ 2012-11-17  5:02                                                               ` Vladislav Bolkhovitin
  -1 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: David Lang
  Cc: Nico Williams, General Discussion of SQLite Database,
	Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel

David Lang, on 11/15/2012 07:07 AM wrote:
>> There's no such thing as "barrier". It is fully artificial abstraction. After
>> all, at the bottom of your stack, you will have to translate it either to cache
>> flush, or commands order enforcement, or both.
>
> When people talk about barriers, they are talking about order enforcement.

Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.

That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.

> the hardware capabilities are not directly accessable from userspace (and they
> probably shouldn't be)

The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.

For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media

1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.

2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.

B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.

Those abstractions can naturally fit all storage capabilities. For instance:

  - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.

  - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.

On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.

> barriers keep getting mentioned because they are a easy concept to understand.

Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-17  5:02                                                               ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: David Lang
  Cc: Nico Williams, General Discussion of SQLite Database,
	Theodore Ts'o, Richard Hipp, linux-kernel, linux-fsdevel

David Lang, on 11/15/2012 07:07 AM wrote:
>> There's no such thing as "barrier". It is fully artificial abstraction. After
>> all, at the bottom of your stack, you will have to translate it either to cache
>> flush, or commands order enforcement, or both.
>
> When people talk about barriers, they are talking about order enforcement.

Not correct. When people are talking about barriers, they are meaning different 
things. For instance, Alan Cox few e-mails ago was meaning cache flush.

That's the problem with the barriers concept: barriers are ambiguous. There's no 
barrier which can fit all requirements.

> the hardware capabilities are not directly accessable from userspace (and they
> probably shouldn't be)

The discussion is not about to directly provide storage hardware capabilities to 
the user space. The discussion is to replace fully inadequate barriers 
abstractions to a set of other, adequate abstractions.

For instance:

1. Cache flush primitives:

1.1. FUA

1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile 
media

1.3. Immediate cache flush, i.e. return ASAP after the cache sync started, 
possibly before all data hit non-volatile media.

2. ORDERED attribute for requests. It provides the following behavior rules:

A.  All requests without this attribute can be executed in parallel and be freely 
reordered.

B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED 
command completed.

Those abstractions can naturally fit all storage capabilities. For instance:

  - On simple WT cache hardware not supporting ordering commands, (1) translates 
to NOP and (2) to queue draining.

  - On full features HW, both (1) and (2) translates to the appropriate storage 
capabilities.

On FTL storage (B) can be further optimized by doing data transfers for ORDERED 
commands in parallel, but commit them in the requested order.

> barriers keep getting mentioned because they are a easy concept to understand.

Well, concept of flat Earth and Sun rotating around it is also easy to understand. 
So, why isn't it used?

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15 16:14                                                               ` 杨苏立 Yang Su Li
@ 2012-11-17  5:02                                                                   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: 杨苏立 Yang Su Li
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel

杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:
> 1. fsync actually does two things at the same time: ordering writes (in a
> barrier-like manner), and forcing cached writes to disk. This makes it very
> difficult to implement fsync efficiently.

Exactly!

> However, logically they are two distinctive functionalities

Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-17  5:02                                                                   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: 杨苏立 Yang Su Li
  Cc: General Discussion of SQLite Database, Theodore Ts'o,
	Richard Hipp, linux-kernel, linux-fsdevel

杨苏立 Yang Su Li, on 11/15/2012 11:14 AM wrote:
> 1. fsync actually does two things at the same time: ordering writes (in a
> barrier-like manner), and forcing cached writes to disk. This makes it very
> difficult to implement fsync efficiently.

Exactly!

> However, logically they are two distinctive functionalities

Exactly!

Those two points are exactly why concept of barriers must be forgotten for sake of 
productivity and be replaced by a finer grained abstractions as well as why they 
where removed from the Linux kernel

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-15 22:35                                                             ` [sqlite] " Chris Friesen
@ 2012-11-17  5:02                                                                 ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ryan Johnson, General Discussion of SQLite Database,
	Vladislav Bolkhovitin, Nico Williams, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp


Chris Friesen, on 11/15/2012 05:35 PM wrote:
>> The easiest way to implement this fsync would involve three things:
>> 1. Schedule writes for all dirty pages in the fs cache that belong to
>> the affected file, wait for the device to report success, issue a cache
>> flush to the device (or request ordering commands, if available) to make
>> it tell the truth, and wait for the device to report success. AFAIK this
>> already happens, but without taking advantage of any request ordering
>> commands.
>> 2. The requesting thread returns as soon as the kernel has identified
>> all data that will be written back. This is new, but pretty similar to
>> what AIO already does.
>> 3. No write is allowed to enqueue any requests at the device that
>> involve the same file, until all outstanding fsync complete [3]. This is
>> new.
>
> This sounds interesting as a way to expose some useful semantics to userspace.
>
> I assume we'd need to come up with a new syscall or something since it doesn't
> match the behaviour of posix fsync().

This is how I would export cache sync and requests ordering abstractions to the 
user space:

For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.

For the regular read()/write() I would add to "flags" parameter of 
sync_file_range() one more flag: if this sync is immediate or not.

To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.

All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.

Vlad

1. See my other today's e-mail about what is immediate cache sync.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-17  5:02                                                                 ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-17  5:02 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ryan Johnson, General Discussion of SQLite Database,
	Vladislav Bolkhovitin, Nico Williams, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp


Chris Friesen, on 11/15/2012 05:35 PM wrote:
>> The easiest way to implement this fsync would involve three things:
>> 1. Schedule writes for all dirty pages in the fs cache that belong to
>> the affected file, wait for the device to report success, issue a cache
>> flush to the device (or request ordering commands, if available) to make
>> it tell the truth, and wait for the device to report success. AFAIK this
>> already happens, but without taking advantage of any request ordering
>> commands.
>> 2. The requesting thread returns as soon as the kernel has identified
>> all data that will be written back. This is new, but pretty similar to
>> what AIO already does.
>> 3. No write is allowed to enqueue any requests at the device that
>> involve the same file, until all outstanding fsync complete [3]. This is
>> new.
>
> This sounds interesting as a way to expose some useful semantics to userspace.
>
> I assume we'd need to come up with a new syscall or something since it doesn't
> match the behaviour of posix fsync().

This is how I would export cache sync and requests ordering abstractions to the 
user space:

For async IO (io_submit() and friends) I would extend struct iocb by flags, which 
would allow to set the required capabilities, i.e. if this request is FUA, or full 
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per 
each iocb.

For the regular read()/write() I would add to "flags" parameter of 
sync_file_range() one more flag: if this sync is immediate or not.

To enforce ordering rules I would add one more command to fcntl(). It would make 
the latest submitted write in this fd ORDERED.

All together those should provide the requested functionality in a simple, 
effective, unambiguous and backward compatible manner.

Vlad

1. See my other today's e-mail about what is immediate cache sync.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-17  5:02                                                                 ` Vladislav Bolkhovitin
@ 2012-11-20  1:23                                                                   ` Vladislav Bolkhovitin
  -1 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-20  1:23 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ryan Johnson, General Discussion of SQLite Database,
	Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel,
	Richard Hipp

Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote:
>>> The easiest way to implement this fsync would involve three things:
>>> 1. Schedule writes for all dirty pages in the fs cache that belong to
>>> the affected file, wait for the device to report success, issue a cache
>>> flush to the device (or request ordering commands, if available) to make
>>> it tell the truth, and wait for the device to report success. AFAIK this
>>> already happens, but without taking advantage of any request ordering
>>> commands.
>>> 2. The requesting thread returns as soon as the kernel has identified
>>> all data that will be written back. This is new, but pretty similar to
>>> what AIO already does.
>>> 3. No write is allowed to enqueue any requests at the device that
>>> involve the same file, until all outstanding fsync complete [3]. This is
>>> new.
>>
>> This sounds interesting as a way to expose some useful semantics to userspace.
>>
>> I assume we'd need to come up with a new syscall or something since it doesn't
>> match the behaviour of posix fsync().
>
> This is how I would export cache sync and requests ordering abstractions to the
> user space:
>
> For async IO (io_submit() and friends) I would extend struct iocb by flags, which
> would allow to set the required capabilities, i.e. if this request is FUA, or full
> cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
> each iocb.
>
> For the regular read()/write() I would add to "flags" parameter of
> sync_file_range() one more flag: if this sync is immediate or not.
>
> To enforce ordering rules I would add one more command to fcntl(). It would make
> the latest submitted write in this fd ORDERED.

Correction. To avoid possible races better that the new fcntl() command would 
specify that N subsequent read()/write()/sync() calls as ORDERED.

For instance, in the simplest case of N=1, one next after fcntl() write() would be 
handled as ORDERED.

(Unfortunately, it doesn't look like this old read()/write() interface has space 
for a more elegant solution)

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-20  1:23                                                                   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-20  1:23 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Ryan Johnson, General Discussion of SQLite Database,
	Nico Williams, linux-fsdevel, Theodore Ts'o, linux-kernel,
	Richard Hipp

Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote:
>>> The easiest way to implement this fsync would involve three things:
>>> 1. Schedule writes for all dirty pages in the fs cache that belong to
>>> the affected file, wait for the device to report success, issue a cache
>>> flush to the device (or request ordering commands, if available) to make
>>> it tell the truth, and wait for the device to report success. AFAIK this
>>> already happens, but without taking advantage of any request ordering
>>> commands.
>>> 2. The requesting thread returns as soon as the kernel has identified
>>> all data that will be written back. This is new, but pretty similar to
>>> what AIO already does.
>>> 3. No write is allowed to enqueue any requests at the device that
>>> involve the same file, until all outstanding fsync complete [3]. This is
>>> new.
>>
>> This sounds interesting as a way to expose some useful semantics to userspace.
>>
>> I assume we'd need to come up with a new syscall or something since it doesn't
>> match the behaviour of posix fsync().
>
> This is how I would export cache sync and requests ordering abstractions to the
> user space:
>
> For async IO (io_submit() and friends) I would extend struct iocb by flags, which
> would allow to set the required capabilities, i.e. if this request is FUA, or full
> cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
> each iocb.
>
> For the regular read()/write() I would add to "flags" parameter of
> sync_file_range() one more flag: if this sync is immediate or not.
>
> To enforce ordering rules I would add one more command to fcntl(). It would make
> the latest submitted write in this fd ORDERED.

Correction. To avoid possible races better that the new fcntl() command would 
specify that N subsequent read()/write()/sync() calls as ORDERED.

For instance, in the simplest case of N=1, one next after fcntl() write() would be 
handled as ORDERED.

(Unfortunately, it doesn't look like this old read()/write() interface has space 
for a more elegant solution)

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
  2012-11-20  1:23                                                                   ` Vladislav Bolkhovitin
@ 2012-11-26 20:05                                                                     ` Nico Williams
  -1 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-11-26 20:05 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Chris Friesen, Ryan Johnson,
	General Discussion of SQLite Database, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp

Vlad,

You keep saying that programmers don't understand "barriers".  You've
provided no evidence of this.  Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.

For some filesystems it is possible to configure fsync() to act as a
barrier: for example, ZFS can be told to perform no synchronous
operations for a given dataset, in which case fsync() devolves into a
simple barrier.  (Cue Simon to tell us that some hardware and some
OSes, and some filesystems simply cannot implement fsync(), with or
without synchronicity.)

So just give us a barrier.  Yes, I know, it's tricky to implement, but
it'd be OK to return EOPNOSUPP, and let the app do something else
(e.g., call fsync() instead, tell the user to expect instability, tell
the user to get a better system, ...).

As for implementation, it helps to have a journalled or log-structured
filesystem.  It also helps to have hardware synchronization primitives
that don't suck, but these aren't entirely necessary: ZFS, for
example, can recover [*] from N incomplete transactions[**], and still
provides fsync() as a barrier given its on-disk structure and the ZIL.
 Note that ZFS recovery from incomplete transactions should never be
necessary where the HW has proper cache flush support, but the
recovery functionality was added precisely because of lousy hardware.

[*]   At volume import time, such as at boot-time.
[**] Granted, this requires user input, but if the user didn't care it
could be made automatic.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-26 20:05                                                                     ` Nico Williams
  0 siblings, 0 replies; 154+ messages in thread
From: Nico Williams @ 2012-11-26 20:05 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Chris Friesen, Ryan Johnson,
	General Discussion of SQLite Database, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp

Vlad,

You keep saying that programmers don't understand "barriers".  You've
provided no evidence of this.  Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.

For some filesystems it is possible to configure fsync() to act as a
barrier: for example, ZFS can be told to perform no synchronous
operations for a given dataset, in which case fsync() devolves into a
simple barrier.  (Cue Simon to tell us that some hardware and some
OSes, and some filesystems simply cannot implement fsync(), with or
without synchronicity.)

So just give us a barrier.  Yes, I know, it's tricky to implement, but
it'd be OK to return EOPNOSUPP, and let the app do something else
(e.g., call fsync() instead, tell the user to expect instability, tell
the user to get a better system, ...).

As for implementation, it helps to have a journalled or log-structured
filesystem.  It also helps to have hardware synchronization primitives
that don't suck, but these aren't entirely necessary: ZFS, for
example, can recover [*] from N incomplete transactions[**], and still
provides fsync() as a barrier given its on-disk structure and the ZIL.
 Note that ZFS recovery from incomplete transactions should never be
necessary where the HW has proper cache flush support, but the
recovery functionality was added precisely because of lousy hardware.

[*]   At volume import time, such as at boot-time.
[**] Granted, this requires user input, but if the user didn't care it
could be made automatic.

Nico
--

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [sqlite] light weight write barriers
@ 2012-11-29  2:15                                                                       ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-29  2:15 UTC (permalink / raw)
  To: Nico Williams
  Cc: Chris Friesen, Ryan Johnson,
	General Discussion of SQLite Database, linux-fsdevel,
	Theodore Ts'o, linux-kernel, Richard Hipp


Nico Williams, on 11/26/2012 03:05 PM wrote:
> Vlad,
>
> You keep saying that programmers don't understand "barriers".  You've
> provided no evidence of this. Meanwhile memory barriers are generally
> well understood, and every programmer I know understands that a
> "barrier" is a synchronization primitive that says that all operations
> of a certain type will have completed prior to the barrier returning
> control to its caller.

Well, your understanding of memory barriers is wrong, and you are illustrating 
that the memory barriers concept is not so well understood on practice.

Simplifying, memory barrier instructions are not "cache flush" of this CPU as it 
is often thought. They set order how reads or writes from other CPUs are visible 
on this CPU. And nothing else. Locally on each CPU reads and writes are always 
seen in order. So, (1) on a single CPU system memory barrier instructions don't 
make any sense and (2) they should go at least in a pair for each participating in 
the interaction CPU, otherwise it's an apparent sign of a mistake.

There's nothing similar in storage, because storage has strong consistency 
requirements even if it is distributed. All those clouds and hadoops with weak 
consistency requirements are outside of this discussion, although even they don't 
have anything similar to memory barriers.

As I already wrote, concept of a flat Earth and Sun revolving around is also very 
simple to understand. Are you still using this concept?

> So just give us a barrier.

Similarly to the flat Earth, I'd strongly suggest you to start using adequate 
concept of what you want to achieve starting from what I proposed few e-mails ago 
in this thread.

If you look at it, it offers exactly what you want, only named correctly.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: light weight write barriers
@ 2012-11-29  2:15                                                                       ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 154+ messages in thread
From: Vladislav Bolkhovitin @ 2012-11-29  2:15 UTC (permalink / raw)
  To: Nico Williams
  Cc: Theodore Ts'o, Richard Hipp, Chris Friesen,
	General Discussion of SQLite Database, linux-kernel,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


Nico Williams, on 11/26/2012 03:05 PM wrote:
> Vlad,
>
> You keep saying that programmers don't understand "barriers".  You've
> provided no evidence of this. Meanwhile memory barriers are generally
> well understood, and every programmer I know understands that a
> "barrier" is a synchronization primitive that says that all operations
> of a certain type will have completed prior to the barrier returning
> control to its caller.

Well, your understanding of memory barriers is wrong, and you are illustrating 
that the memory barriers concept is not so well understood on practice.

Simplifying, memory barrier instructions are not "cache flush" of this CPU as it 
is often thought. They set order how reads or writes from other CPUs are visible 
on this CPU. And nothing else. Locally on each CPU reads and writes are always 
seen in order. So, (1) on a single CPU system memory barrier instructions don't 
make any sense and (2) they should go at least in a pair for each participating in 
the interaction CPU, otherwise it's an apparent sign of a mistake.

There's nothing similar in storage, because storage has strong consistency 
requirements even if it is distributed. All those clouds and hadoops with weak 
consistency requirements are outside of this discussion, although even they don't 
have anything similar to memory barriers.

As I already wrote, concept of a flat Earth and Sun revolving around is also very 
simple to understand. Are you still using this concept?

> So just give us a barrier.

Similarly to the flat Earth, I'd strongly suggest you to start using adequate 
concept of what you want to achieve starting from what I proposed few e-mails ago 
in this thread.

If you look at it, it offers exactly what you want, only named correctly.

Vlad

^ permalink raw reply	[flat|nested] 154+ messages in thread

end of thread, other threads:[~2012-11-29  2:15 UTC | newest]

Thread overview: 154+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <415E76CC-A53D-4643-88AB-3D7D7DC56F98@dubeyko.com>
2012-10-06 13:54 ` [PATCH 00/16] f2fs: introduce flash-friendly file system Vyacheslav Dubeyko
2012-10-06 20:06   ` Jaegeuk Kim
2012-10-07  7:09     ` Marco Stornelli
2012-10-07  9:31       ` Jaegeuk Kim
2012-10-07  9:31         ` Jaegeuk Kim
2012-10-07 12:08         ` Vyacheslav Dubeyko
2012-10-07 12:08           ` Vyacheslav Dubeyko
2012-10-08  8:25           ` Jaegeuk Kim
2012-10-08  8:25             ` Jaegeuk Kim
2012-10-08  9:59             ` Namjae Jeon
2012-10-08  9:59               ` Namjae Jeon
2012-10-08 10:52               ` Jaegeuk Kim
2012-10-08 11:21                 ` Namjae Jeon
2012-10-08 12:11                   ` Jaegeuk Kim
2012-10-09  3:52                     ` Namjae Jeon
2012-10-09  8:00                       ` Jaegeuk Kim
2012-10-09  8:31                 ` Lukáš Czerner
2012-10-09 10:45                   ` Jaegeuk Kim
2012-10-09 10:45                     ` Jaegeuk Kim
2012-10-09 11:01                     ` Lukáš Czerner
2012-10-09 12:01                       ` Jaegeuk Kim
2012-10-09 12:39                         ` Lukáš Czerner
2012-10-09 13:10                           ` Jaegeuk Kim
2012-10-09 21:20                         ` Dave Chinner
2012-10-09 21:20                           ` Dave Chinner
2012-10-10  2:32                           ` Jaegeuk Kim
2012-10-10  4:53                       ` Theodore Ts'o
2012-10-10  4:53                         ` Theodore Ts'o
2012-10-12 20:55                         ` Arnd Bergmann
2012-10-10 10:36                   ` David Woodhouse
2012-10-12 20:58                     ` Arnd Bergmann
2012-10-13  4:26                       ` Namjae Jeon
2012-10-13 12:37                         ` Jaegeuk Kim
2012-10-13 12:37                           ` Jaegeuk Kim
2012-10-17 11:12                           ` Namjae Jeon
     [not found]                             ` <000001cdacef$b2f6eaa0$18e4bfe0$%kim@samsung.com>
2012-10-18 13:39                               ` Vyacheslav Dubeyko
2012-10-18 22:14                                 ` Jaegeuk Kim
2012-10-19  9:20                                 ` NeilBrown
2012-10-08 19:22             ` Vyacheslav Dubeyko
2012-10-09  7:08               ` Jaegeuk Kim
2012-10-09  7:08                 ` Jaegeuk Kim
2012-10-09 19:53                 ` Jooyoung Hwang
2012-10-09 19:53                   ` Jooyoung Hwang
2012-10-10  8:05                   ` Vyacheslav Dubeyko
2012-10-10  9:02                   ` Theodore Ts'o
2012-10-10 11:52                     ` SQLite on flash (was: [PATCH 00/16] f2fs: introduce flash-friendly file system) Clemens Ladisch
     [not found]                       ` <50756199.1090103-P6GI/4k7KOmELgA04lAiVw@public.gmane.org>
2012-10-10 12:47                         ` Richard Hipp
2012-10-10 17:17                           ` light weight write barriers Andi Kleen
     [not found]                             ` <m2fw5mtffg.fsf_-_-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
2012-10-10 17:48                               ` Richard Hipp
2012-10-11 16:38                                 ` [sqlite] " Nico Williams
2012-10-11 16:38                                   ` Nico Williams
2012-10-11 16:48                                   ` [sqlite] " Nico Williams
2012-10-11 16:48                                     ` Nico Williams
2012-10-11 16:32                             ` [sqlite] " 杨苏立 Yang Su Li
2012-10-11 16:32                               ` 杨苏立 Yang Su Li
2012-10-11 17:41                               ` [sqlite] " Christoph Hellwig
2012-10-23 19:53                               ` Vladislav Bolkhovitin
2012-10-24 21:17                                 ` Nico Williams
2012-10-24 21:17                                   ` Nico Williams
2012-10-24 22:03                                   ` [sqlite] " david
2012-10-25  0:20                                     ` Nico Williams
2012-10-25  0:20                                       ` Nico Williams
2012-10-25  1:04                                       ` [sqlite] " david
2012-10-25  5:18                                         ` Nico Williams
2012-10-25  5:18                                           ` Nico Williams
2012-10-25  6:02                                           ` [sqlite] " Theodore Ts'o
2012-10-25  6:58                                             ` david
2012-10-25 14:03                                               ` Theodore Ts'o
2012-10-25 14:03                                                 ` Theodore Ts'o
2012-10-25 18:03                                                 ` [sqlite] " david
2012-10-25 18:03                                                   ` david-gFPdbfVZQbY
2012-10-25 18:29                                                   ` [sqlite] " Theodore Ts'o
2012-10-25 18:29                                                     ` Theodore Ts'o
2012-11-05 20:03                                                     ` [sqlite] " Pavel Machek
2012-11-05 20:03                                                       ` Pavel Machek
2012-11-05 22:04                                                       ` Theodore Ts'o
2012-11-05 22:04                                                         ` Theodore Ts'o
     [not found]                                                         ` <20121105220440.GB25378-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2012-11-05 22:37                                                           ` Richard Hipp
2012-11-05 23:00                                                             ` [sqlite] " Theodore Ts'o
2012-11-05 23:00                                                               ` Theodore Ts'o
2012-10-30 23:49                                             ` [sqlite] " Nico Williams
2012-10-25  5:42                                     ` Theodore Ts'o
2012-10-25  7:11                                       ` david
2012-10-27  1:52                                   ` Vladislav Bolkhovitin
2012-10-25  5:14                                 ` Theodore Ts'o
2012-10-25 13:03                                   ` Alan Cox
2012-10-25 13:50                                     ` Theodore Ts'o
2012-10-25 13:50                                       ` Theodore Ts'o
2012-10-27  1:55                                       ` [sqlite] " Vladislav Bolkhovitin
2012-10-27  1:54                                   ` Vladislav Bolkhovitin
2012-10-27  4:44                                     ` Theodore Ts'o
2012-10-27  4:44                                       ` Theodore Ts'o
2012-10-30 22:22                                       ` [sqlite] " Vladislav Bolkhovitin
2012-10-31  9:54                                         ` Alan Cox
2012-10-31  9:54                                           ` Alan Cox
2012-11-01 20:18                                           ` [sqlite] " Vladislav Bolkhovitin
2012-11-01 21:24                                             ` Alan Cox
2012-11-01 21:24                                               ` Alan Cox
2012-11-02  0:15                                               ` [sqlite] " Vladislav Bolkhovitin
2012-11-02  0:38                                               ` Howard Chu
2012-11-02  0:38                                                 ` Howard Chu
     [not found]                                                 ` <50931601.4060102-aQkYFu9vm6AAvxtiuMwx3w@public.gmane.org>
2012-11-02 12:24                                                   ` Richard Hipp
2012-11-13  3:41                                                     ` [sqlite] " Vladislav Bolkhovitin
2012-11-02 12:33                                                 ` Alan Cox
2012-11-02 12:33                                                   ` Alan Cox
2012-11-13  3:41                                                   ` [sqlite] " Vladislav Bolkhovitin
2012-11-13  3:41                                                     ` Vladislav Bolkhovitin
2012-11-13 17:40                                                     ` Alan Cox
2012-11-13 17:40                                                       ` Alan Cox
2012-11-13 19:13                                                       ` [sqlite] " Nico Williams
2012-11-13 19:13                                                         ` Nico Williams
2012-11-15  1:17                                                         ` [sqlite] " Vladislav Bolkhovitin
2012-11-15 12:07                                                           ` David Lang
2012-11-15 12:07                                                             ` David Lang
     [not found]                                                             ` <alpine.DEB.2.02.1211150353080.32408-UEhY+ZBZOcqqLGM74eQ/YA@public.gmane.org>
2012-11-15 16:14                                                               ` 杨苏立 Yang Su Li
2012-11-17  5:02                                                                 ` [sqlite] " Vladislav Bolkhovitin
2012-11-17  5:02                                                                   ` Vladislav Bolkhovitin
2012-11-16 15:06                                                             ` Howard Chu
2012-11-16 15:06                                                               ` Howard Chu
2012-11-16 15:31                                                               ` [sqlite] " Ric Wheeler
2012-11-16 15:54                                                                 ` Howard Chu
2012-11-16 15:54                                                                   ` Howard Chu
2012-11-16 18:03                                                                   ` [sqlite] " Ric Wheeler
2012-11-16 18:03                                                                     ` Ric Wheeler
2012-11-16 19:14                                                               ` David Lang
2012-11-16 19:14                                                                 ` David Lang
2012-11-17  5:02                                                             ` [sqlite] " Vladislav Bolkhovitin
2012-11-17  5:02                                                               ` Vladislav Bolkhovitin
2012-11-15 17:06                                                           ` Ryan Johnson
2012-11-15 17:06                                                             ` Ryan Johnson
2012-11-15 22:35                                                             ` [sqlite] " Chris Friesen
2012-11-17  5:02                                                               ` Vladislav Bolkhovitin
2012-11-17  5:02                                                                 ` Vladislav Bolkhovitin
2012-11-20  1:23                                                                 ` Vladislav Bolkhovitin
2012-11-20  1:23                                                                   ` Vladislav Bolkhovitin
2012-11-26 20:05                                                                   ` Nico Williams
2012-11-26 20:05                                                                     ` Nico Williams
2012-11-29  2:15                                                                     ` Vladislav Bolkhovitin
2012-11-29  2:15                                                                       ` Vladislav Bolkhovitin
2012-11-15  1:16                                                       ` [sqlite] " Vladislav Bolkhovitin
2012-11-15  1:16                                                         ` Vladislav Bolkhovitin
2012-11-13  3:37                                                 ` Vladislav Bolkhovitin
     [not found]                                     ` <508B3EED.2080003-d+Crzxg7Rs0@public.gmane.org>
2012-11-11  4:25                                       ` 杨苏立 Yang Su Li
2012-11-13  3:42                                         ` [sqlite] " Vladislav Bolkhovitin
2012-11-13  3:42                                           ` Vladislav Bolkhovitin
2012-10-10  7:57                 ` [PATCH 00/16] f2fs: introduce flash-friendly file system Vyacheslav Dubeyko
2012-10-10  9:43                   ` Jaegeuk Kim
2012-10-11  3:14                     ` Namjae Jeon
     [not found]                       ` <CAN863PuyMkSZtZCvqX+kwei9v=rnbBYVYr3TqBXF_6uxwJe2_Q@mail.gmail.com>
2012-10-17 11:13                         ` Namjae Jeon
2012-10-17 23:06                           ` Changman Lee
2012-10-12 12:30                     ` Vyacheslav Dubeyko
2012-10-12 14:25                       ` Jaegeuk Kim
2012-10-07 10:15     ` Vyacheslav Dubeyko
2012-10-07 10:15       ` Vyacheslav Dubeyko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.