All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux Plumbers IO & File System Micro-conference
@ 2013-07-12 17:20 ` Ric Wheeler
  0 siblings, 0 replies; 14+ messages in thread
From: Ric Wheeler @ 2013-07-12 17:20 UTC (permalink / raw)
  To: linux-scsi, Linux FS Devel, linux-nfs, linux-btrfs, xfs-oss,
	linux-ext4, device-mapper development, IDE/ATA development list,
	anaconda-devel-list


Linux Plumbers has approved a file and storage microconference. The overview 
page is here:

http://wiki.linuxplumbersconf.org/2013:file_and_storage_systems

I would like to started gathering in ideas for topics. I have been approached 
already with a request to cover the copy_range work Zach Brown kicked off at LSF 
which I think is a great topic since it has impact throughout the stack, all the 
way up to applications programmers.

A similar request was made to give some time to non-volatile memory devices and 
the proposed "persistent memory" backed block device and persistent memory file 
system from Intel. That also seems to fall in nicely with the plumbers mission 
statement since it clearly will impact everything from the application on down 
through the kernel.

Last year, we had a good discussion on management of storage (anaconda, yast, 
libstoragemgt, etc) - that also might be worth giving updates on.

The instructions for submitting ideas will be updated later today at:

http://www.linuxplumbersconf.org/2013/submitting-topic/

If you have topics that you would like to add, wait until the instructions get 
posted at the link above. If you are impatient, feel free to email me directly 
(but probably best to drop the broad mailing lists from the reply).

Thanks!

Ric


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Linux Plumbers IO & File System Micro-conference
@ 2013-07-12 17:20 ` Ric Wheeler
  0 siblings, 0 replies; 14+ messages in thread
From: Ric Wheeler @ 2013-07-12 17:20 UTC (permalink / raw)
  To: linux-scsi, Linux FS Devel, linux-nfs, linux-btrfs, xfs-oss,
	linux-ext4, device-mapper development, IDE/ATA development list,
	anaconda-devel-list


Linux Plumbers has approved a file and storage microconference. The overview 
page is here:

http://wiki.linuxplumbersconf.org/2013:file_and_storage_systems

I would like to started gathering in ideas for topics. I have been approached 
already with a request to cover the copy_range work Zach Brown kicked off at LSF 
which I think is a great topic since it has impact throughout the stack, all the 
way up to applications programmers.

A similar request was made to give some time to non-volatile memory devices and 
the proposed "persistent memory" backed block device and persistent memory file 
system from Intel. That also seems to fall in nicely with the plumbers mission 
statement since it clearly will impact everything from the application on down 
through the kernel.

Last year, we had a good discussion on management of storage (anaconda, yast, 
libstoragemgt, etc) - that also might be worth giving updates on.

The instructions for submitting ideas will be updated later today at:

http://www.linuxplumbersconf.org/2013/submitting-topic/

If you have topics that you would like to add, wait until the instructions get 
posted at the link above. If you are impatient, feel free to email me directly 
(but probably best to drop the broad mailing lists from the reply).

Thanks!

Ric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Linux Plumbers IO & File System Micro-conference
  2013-07-12 17:20 ` Ric Wheeler
@ 2013-07-12 17:42   ` faibish, sorin
  -1 siblings, 0 replies; 14+ messages in thread
From: faibish, sorin @ 2013-07-12 17:42 UTC (permalink / raw)
  To: Ric Wheeler, linux-scsi, Linux FS Devel, linux-nfs, linux-btrfs,
	xfs-oss, linux-ext4, device-mapper development,
	IDE/ATA development list, anaconda-devel-list

Can we have a discussion on Lustre client in the kernel? Thanks

./Sorin

-----Original Message-----
From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Friday, July 12, 2013 1:21 PM
To: linux-scsi@vger.kernel.org; Linux FS Devel; linux-nfs@vger.kernel.org; linux-btrfs; xfs-oss; linux-ext4@vger.kernel.org; device-mapper development; IDE/ATA development list; anaconda-devel-list@redhat.com
Subject: Linux Plumbers IO & File System Micro-conference


Linux Plumbers has approved a file and storage microconference. The overview page is here:

http://wiki.linuxplumbersconf.org/2013:file_and_storage_systems

I would like to started gathering in ideas for topics. I have been approached already with a request to cover the copy_range work Zach Brown kicked off at LSF which I think is a great topic since it has impact throughout the stack, all the way up to applications programmers.

A similar request was made to give some time to non-volatile memory devices and the proposed "persistent memory" backed block device and persistent memory file system from Intel. That also seems to fall in nicely with the plumbers mission statement since it clearly will impact everything from the application on down through the kernel.

Last year, we had a good discussion on management of storage (anaconda, yast, libstoragemgt, etc) - that also might be worth giving updates on.

The instructions for submitting ideas will be updated later today at:

http://www.linuxplumbersconf.org/2013/submitting-topic/

If you have topics that you would like to add, wait until the instructions get posted at the link above. If you are impatient, feel free to email me directly (but probably best to drop the broad mailing lists from the reply).

Thanks!

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Linux Plumbers IO & File System Micro-conference
@ 2013-07-12 17:42   ` faibish, sorin
  0 siblings, 0 replies; 14+ messages in thread
From: faibish, sorin @ 2013-07-12 17:42 UTC (permalink / raw)
  To: Ric Wheeler, linux-scsi, Linux FS Devel, linux-nfs, linux-btrfs,
	xfs-oss, linux-ext4, device-mapper development,
	IDE/ATA development list, anaconda-devel-list

Can we have a discussion on Lustre client in the kernel? Thanks

./Sorin

-----Original Message-----
From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Friday, July 12, 2013 1:21 PM
To: linux-scsi@vger.kernel.org; Linux FS Devel; linux-nfs@vger.kernel.org; linux-btrfs; xfs-oss; linux-ext4@vger.kernel.org; device-mapper development; IDE/ATA development list; anaconda-devel-list@redhat.com
Subject: Linux Plumbers IO & File System Micro-conference


Linux Plumbers has approved a file and storage microconference. The overview page is here:

http://wiki.linuxplumbersconf.org/2013:file_and_storage_systems

I would like to started gathering in ideas for topics. I have been approached already with a request to cover the copy_range work Zach Brown kicked off at LSF which I think is a great topic since it has impact throughout the stack, all the way up to applications programmers.

A similar request was made to give some time to non-volatile memory devices and the proposed "persistent memory" backed block device and persistent memory file system from Intel. That also seems to fall in nicely with the plumbers mission statement since it clearly will impact everything from the application on down through the kernel.

Last year, we had a good discussion on management of storage (anaconda, yast, libstoragemgt, etc) - that also might be worth giving updates on.

The instructions for submitting ideas will be updated later today at:

http://www.linuxplumbersconf.org/2013/submitting-topic/

If you have topics that you would like to add, wait until the instructions get posted at the link above. If you are impatient, feel free to email me directly (but probably best to drop the broad mailing lists from the reply).

Thanks!

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
  2013-07-12 17:42   ` faibish, sorin
@ 2013-07-15 21:22     ` Ric Wheeler
  -1 siblings, 0 replies; 14+ messages in thread
From: Ric Wheeler @ 2013-07-15 21:22 UTC (permalink / raw)
  To: faibish, sorin
  Cc: linux-scsi, Linux FS Devel, linux-nfs, linux-btrfs, xfs-oss,
	linux-ext4, device-mapper development, IDE/ATA development list,
	anaconda-devel-list

On 07/12/2013 01:42 PM, faibish, sorin wrote:
> Can we have a discussion on Lustre client in the kernel? Thanks
>
> ./Sorin

I am not sure that we have that much to do for Lustre on the client side. Is 
this a topic that would be of broad enough interest to include people outside of 
the kernel developers?

thanks!

Ric

>
> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Friday, July 12, 2013 1:21 PM
> To: linux-scsi@vger.kernel.org; Linux FS Devel; linux-nfs@vger.kernel.org; linux-btrfs; xfs-oss; linux-ext4@vger.kernel.org; device-mapper development; IDE/ATA development list; anaconda-devel-list@redhat.com
> Subject: Linux Plumbers IO & File System Micro-conference
>
>
> Linux Plumbers has approved a file and storage microconference. The overview page is here:
>
> http://wiki.linuxplumbersconf.org/2013:file_and_storage_systems
>
> I would like to started gathering in ideas for topics. I have been approached already with a request to cover the copy_range work Zach Brown kicked off at LSF which I think is a great topic since it has impact throughout the stack, all the way up to applications programmers.
>
> A similar request was made to give some time to non-volatile memory devices and the proposed "persistent memory" backed block device and persistent memory file system from Intel. That also seems to fall in nicely with the plumbers mission statement since it clearly will impact everything from the application on down through the kernel.
>
> Last year, we had a good discussion on management of storage (anaconda, yast, libstoragemgt, etc) - that also might be worth giving updates on.
>
> The instructions for submitting ideas will be updated later today at:
>
> http://www.linuxplumbersconf.org/2013/submitting-topic/
>
> If you have topics that you would like to add, wait until the instructions get posted at the link above. If you are impatient, feel free to email me directly (but probably best to drop the broad mailing lists from the reply).
>
> Thanks!
>
> Ric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
@ 2013-07-15 21:22     ` Ric Wheeler
  0 siblings, 0 replies; 14+ messages in thread
From: Ric Wheeler @ 2013-07-15 21:22 UTC (permalink / raw)
  To: faibish, sorin
  Cc: linux-nfs, linux-scsi, xfs-oss, IDE/ATA development list,
	device-mapper development, anaconda-devel-list, Linux FS Devel,
	linux-ext4, linux-btrfs

On 07/12/2013 01:42 PM, faibish, sorin wrote:
> Can we have a discussion on Lustre client in the kernel? Thanks
>
> ./Sorin

I am not sure that we have that much to do for Lustre on the client side. Is 
this a topic that would be of broad enough interest to include people outside of 
the kernel developers?

thanks!

Ric

>
> -----Original Message-----
> From: linux-fsdevel-owner@vger.kernel.org [mailto:linux-fsdevel-owner@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Friday, July 12, 2013 1:21 PM
> To: linux-scsi@vger.kernel.org; Linux FS Devel; linux-nfs@vger.kernel.org; linux-btrfs; xfs-oss; linux-ext4@vger.kernel.org; device-mapper development; IDE/ATA development list; anaconda-devel-list@redhat.com
> Subject: Linux Plumbers IO & File System Micro-conference
>
>
> Linux Plumbers has approved a file and storage microconference. The overview page is here:
>
> http://wiki.linuxplumbersconf.org/2013:file_and_storage_systems
>
> I would like to started gathering in ideas for topics. I have been approached already with a request to cover the copy_range work Zach Brown kicked off at LSF which I think is a great topic since it has impact throughout the stack, all the way up to applications programmers.
>
> A similar request was made to give some time to non-volatile memory devices and the proposed "persistent memory" backed block device and persistent memory file system from Intel. That also seems to fall in nicely with the plumbers mission statement since it clearly will impact everything from the application on down through the kernel.
>
> Last year, we had a good discussion on management of storage (anaconda, yast, libstoragemgt, etc) - that also might be worth giving updates on.
>
> The instructions for submitting ideas will be updated later today at:
>
> http://www.linuxplumbersconf.org/2013/submitting-topic/
>
> If you have topics that you would like to add, wait until the instructions get posted at the link above. If you are impatient, feel free to email me directly (but probably best to drop the broad mailing lists from the reply).
>
> Thanks!
>
> Ric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
  2013-07-12 17:20 ` Ric Wheeler
  (?)
  (?)
@ 2013-07-19 19:52 ` Bernd Schubert
  2013-07-19 19:57   ` Ric Wheeler
  2013-07-22  0:47   ` Dave Chinner
  -1 siblings, 2 replies; 14+ messages in thread
From: Bernd Schubert @ 2013-07-19 19:52 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-mm, Linux FS Devel, Mel Gorman, Andreas Dilger, sage

Hello Ric, hi all,

On 07/12/2013 07:20 PM, Ric Wheeler wrote:
>
> If you have topics that you would like to add, wait until the
> instructions get posted at the link above. If you are impatient, feel
> free to email me directly (but probably best to drop the broad mailing
> lists from the reply).

sorry, that will be a rather long introduction, the short conclusion is 
below.


Introduction to the meta-cache issue:
=====================================
For quite a while we are redesigning our FhGFS storage layout to 
workaround meta-cache issues of underlying file systems. However, there 
are constraints as data and meta-data are distributed on between several 
targets/servers. Other distributed file systems, such as Lustre and (I 
think) cepfs should have the similar issues.

So the main issue we have is that streaming reads/writes evict 
meta-pages from the page-cache. I.e. this results in lots of 
directory-block reads on creating files. So FhGFS, Lustre an (I believe) 
cephfs are using hash-directories to store object files. Access to files 
in these hash-directories is rather random and with increasing number of 
files, access to hash directory-blocks/pages also gets entirely random. 
Streaming IO easily evicts these pages, which results in high latencies 
when users perform file creates/deletes, as corresponding directory 
blocks have to be re-read from disk again and again.
Now one could argue that hash-directories are poor choice and indeed we 
are mostly solving that issue in FhGFS now(currently stable release on 
the meta side, upcoming release on the data/storage side).
However, given by the problem of distributed meta-data and distributed 
data we have not found a way yet to entirely eliminate hash directories. 
For example, recently one of our users created 80 million directories 
with one or two files in these directories and even with the new layout 
that still would be an issue. It even is an issue with direct access on 
the underlying file system. Of course,  basically empty directories 
should be avoided at all, but users have their own way of doing IO.
Furthermore, the meta-cache vs. streaming-cache issue is not limited to 
directory blocks only, but any cached meta-data are affected. Mel 
recently wrote a few patches to improve meta-caching ("Obey 
mark_page_accessed hint given by filesystems"), but at least for our 
directory-block issue that doesn't seem to help.

Conclusion:
===========
 From my point of view, there should be a small, but configurable, 
number pages reserved for meta-data only. If streaming IO wouldn't be 
able evict these pages, our and other file systems meta-cache issues 
probably would be entire solved at all.


Example:
========

Just a very basic simple bonnie++ test with 60000 files on ext4 with 
inlined data to reduce block and bitmap lookups and writes.

Entirely cached hash directories (16384), which are populated with about 
16 million files, so 1000 files per hash-dir.

> Version  1.96       ------Sequential Create------ --------Random Create--------
> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>            60:32:32  1702  14  2025  12  1332   4  1873  16  2047  13  1266   3
> Latency              3874ms    6645ms    8659ms     505ms    7257ms    9627ms
> 1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms


Now after clients did some streaming IO:

> Version  1.96       ------Sequential Create------ --------Random Create--------
> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>            60:32:32   541   4  2343  16  2103   6   586   5  1947  13  1603   4
> Latency               190ms     166ms    3459ms    6762ms    6518ms    9185ms


With longer/more streaming that can go down to 25 creates/s. iostat and 
btrace show lots of meta-reads then, which correspond to directory-block 
reads.

Now after running 'find' over these hash directories to re-read all blocks:

> Version  1.96       ------Sequential Create------ --------Random Create--------
> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>            60:32:32  1878  16  2766  16  2464   7  1506  13  2054  13  1433   4
> Latency               349ms     164ms    1594ms    7730ms    6204ms    8112ms



Would a dedicated meta-cache be a topic for discussion?


Thanks,
Bernd

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
  2013-07-19 19:52 ` Bernd Schubert
@ 2013-07-19 19:57   ` Ric Wheeler
  2013-07-22  0:47   ` Dave Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: Ric Wheeler @ 2013-07-19 19:57 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-mm, Linux FS Devel, Mel Gorman, Andreas Dilger, sage

On 07/19/2013 03:52 PM, Bernd Schubert wrote:
> Hello Ric, hi all,
>
> On 07/12/2013 07:20 PM, Ric Wheeler wrote:
>>
>> If you have topics that you would like to add, wait until the
>> instructions get posted at the link above. If you are impatient, feel
>> free to email me directly (but probably best to drop the broad mailing
>> lists from the reply).
>
> sorry, that will be a rather long introduction, the short conclusion is below.
>
>
> Introduction to the meta-cache issue:
> =====================================
> For quite a while we are redesigning our FhGFS storage layout to workaround 
> meta-cache issues of underlying file systems. However, there are constraints 
> as data and meta-data are distributed on between several targets/servers. 
> Other distributed file systems, such as Lustre and (I think) cepfs should have 
> the similar issues.
>
> So the main issue we have is that streaming reads/writes evict meta-pages from 
> the page-cache. I.e. this results in lots of directory-block reads on creating 
> files. So FhGFS, Lustre an (I believe) cephfs are using hash-directories to 
> store object files. Access to files in these hash-directories is rather random 
> and with increasing number of files, access to hash directory-blocks/pages 
> also gets entirely random. Streaming IO easily evicts these pages, which 
> results in high latencies when users perform file creates/deletes, as 
> corresponding directory blocks have to be re-read from disk again and again.
> Now one could argue that hash-directories are poor choice and indeed we are 
> mostly solving that issue in FhGFS now(currently stable release on the meta 
> side, upcoming release on the data/storage side).
> However, given by the problem of distributed meta-data and distributed data we 
> have not found a way yet to entirely eliminate hash directories. For example, 
> recently one of our users created 80 million directories with one or two files 
> in these directories and even with the new layout that still would be an 
> issue. It even is an issue with direct access on the underlying file system. 
> Of course,  basically empty directories should be avoided at all, but users 
> have their own way of doing IO.
> Furthermore, the meta-cache vs. streaming-cache issue is not limited to 
> directory blocks only, but any cached meta-data are affected. Mel recently 
> wrote a few patches to improve meta-caching ("Obey mark_page_accessed hint 
> given by filesystems"), but at least for our directory-block issue that 
> doesn't seem to help.
>
> Conclusion:
> ===========
> From my point of view, there should be a small, but configurable, number pages 
> reserved for meta-data only. If streaming IO wouldn't be able evict these 
> pages, our and other file systems meta-cache issues probably would be entire 
> solved at all.
>
>
> Example:
> ========
>
> Just a very basic simple bonnie++ test with 60000 files on ext4 with inlined 
> data to reduce block and bitmap lookups and writes.
>
> Entirely cached hash directories (16384), which are populated with about 16 
> million files, so 1000 files per hash-dir.
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32  1702  14  2025  12  1332   4  1873  16 2047  13  1266   3
>> Latency              3874ms    6645ms    8659ms     505ms 7257ms    9627ms
>> 1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms 
>>
>
>
> Now after clients did some streaming IO:
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32   541   4  2343  16  2103   6   586   5 1947  13  1603   4
>> Latency               190ms     166ms    3459ms    6762ms 6518ms    9185ms
>
>
> With longer/more streaming that can go down to 25 creates/s. iostat and btrace 
> show lots of meta-reads then, which correspond to directory-block reads.
>
> Now after running 'find' over these hash directories to re-read all blocks:
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32  1878  16  2766  16  2464   7  1506  13 2054  13  1433   4
>> Latency               349ms     164ms    1594ms    7730ms 6204ms    8112ms
>
>
>
> Would a dedicated meta-cache be a topic for discussion?
>
>
> Thanks,
> Bernd
>

Hi Bernd,

I think that sounds like an interesting idea to discuss - can you add a proposal 
here:

http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/proposals

Thanks!

Ric


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
  2013-07-19 19:52 ` Bernd Schubert
  2013-07-19 19:57   ` Ric Wheeler
@ 2013-07-22  0:47   ` Dave Chinner
  2013-07-22 12:36       ` Bernd Schubert
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2013-07-22  0:47 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Ric Wheeler, linux-mm, Linux FS Devel, Mel Gorman, Andreas Dilger, sage

On Fri, Jul 19, 2013 at 09:52:00PM +0200, Bernd Schubert wrote:
> Hello Ric, hi all,
> 
> On 07/12/2013 07:20 PM, Ric Wheeler wrote:
> >
> >If you have topics that you would like to add, wait until the
> >instructions get posted at the link above. If you are impatient, feel
> >free to email me directly (but probably best to drop the broad mailing
> >lists from the reply).
> 
> sorry, that will be a rather long introduction, the short conclusion
> is below.
> 
> 
> Introduction to the meta-cache issue:
> =====================================
> For quite a while we are redesigning our FhGFS storage layout to
> workaround meta-cache issues of underlying file systems. However,
> there are constraints as data and meta-data are distributed on
> between several targets/servers. Other distributed file systems,
> such as Lustre and (I think) cepfs should have the similar issues.
> 
> So the main issue we have is that streaming reads/writes evict
> meta-pages from the page-cache. I.e. this results in lots of
> directory-block reads on creating files. So FhGFS, Lustre an (I
> believe) cephfs are using hash-directories to store object files.
> Access to files in these hash-directories is rather random and with
> increasing number of files, access to hash directory-blocks/pages
> also gets entirely random. Streaming IO easily evicts these pages,
> which results in high latencies when users perform file
> creates/deletes, as corresponding directory blocks have to be
> re-read from disk again and again.

Sounds like a filesystem specific problem. Different filesystems
have different ways of caching metadata and respond differently to
page cache pressure.

For example, we changed XFS to have it's own metdata buffer cache
reclaim mechanisms driven by a shrinker that uses prioritised cache
reclaim to ensure we reclaim less important metadata buffers before
ones that are more frequently hit (e.g. to reclaim tree leaves
before nodes and roots). This was done because the page cache based
reclaim of metadata was completely inadequate (i.e. mostly random!)
and would frequently reclaim the wrong thing and cause performance
under memory pressure to tank....

> From my point of view, there should be a small, but configurable,
> number pages reserved for meta-data only. If streaming IO wouldn't
> be able evict these pages, our and other file systems meta-cache
> issues probably would be entire solved at all.

That's effectively what XFS does automatically - it doesn't reserve
pages, but it holds onto the frequently hit metadata buffers much,
much harder than any other Linux filesystem....

> Example:
> ========
> 
> Just a very basic simple bonnie++ test with 60000 files on ext4 with
> inlined data to reduce block and bitmap lookups and writes.
> 
> Entirely cached hash directories (16384), which are populated with
> about 16 million files, so 1000 files per hash-dir.
> 
> >Version  1.96       ------Sequential Create------ --------Random Create--------
> >fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> >files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> >           60:32:32  1702  14  2025  12  1332   4  1873  16  2047  13  1266   3
> >Latency              3874ms    6645ms    8659ms     505ms    7257ms    9627ms
> >1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms

Command line parameters, details of storage, the scripts you are
running, etc please. RAM as well, as 16 million files are going to
require at least 20GB RAM to fully cache...

Numbers without context or with "handwavy context" are meaningless
for the purpose of analysis and understanding.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
  2013-07-22  0:47   ` Dave Chinner
@ 2013-07-22 12:36       ` Bernd Schubert
  0 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2013-07-22 12:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, linux-mm, Linux FS Devel, Mel Gorman, Andreas Dilger, sage

On 07/22/2013 02:47 AM, Dave Chinner wrote:
> On Fri, Jul 19, 2013 at 09:52:00PM +0200, Bernd Schubert wrote:
>> Hello Ric, hi all,
>>
>> On 07/12/2013 07:20 PM, Ric Wheeler wrote:
>>>
>>> If you have topics that you would like to add, wait until the
>>> instructions get posted at the link above. If you are impatient, feel
>>> free to email me directly (but probably best to drop the broad mailing
>>> lists from the reply).
>>
>> sorry, that will be a rather long introduction, the short conclusion
>> is below.
>>
>>
>> Introduction to the meta-cache issue:
>> =====================================
>> For quite a while we are redesigning our FhGFS storage layout to
>> workaround meta-cache issues of underlying file systems. However,
>> there are constraints as data and meta-data are distributed on
>> between several targets/servers. Other distributed file systems,
>> such as Lustre and (I think) cepfs should have the similar issues.
>>
>> So the main issue we have is that streaming reads/writes evict
>> meta-pages from the page-cache. I.e. this results in lots of
>> directory-block reads on creating files. So FhGFS, Lustre an (I
>> believe) cephfs are using hash-directories to store object files.
>> Access to files in these hash-directories is rather random and with
>> increasing number of files, access to hash directory-blocks/pages
>> also gets entirely random. Streaming IO easily evicts these pages,
>> which results in high latencies when users perform file
>> creates/deletes, as corresponding directory blocks have to be
>> re-read from disk again and again.
>
> Sounds like a filesystem specific problem. Different filesystems
> have different ways of caching metadata and respond differently to
> page cache pressure.
>
> For example, we changed XFS to have it's own metdata buffer cache
> reclaim mechanisms driven by a shrinker that uses prioritised cache
> reclaim to ensure we reclaim less important metadata buffers before
> ones that are more frequently hit (e.g. to reclaim tree leaves
> before nodes and roots). This was done because the page cache based
> reclaim of metadata was completely inadequate (i.e. mostly random!)
> and would frequently reclaim the wrong thing and cause performance
> under memory pressure to tank....

Well, especially with XFS I see reads all the time and btrace tells me 
these are meta-reads. So far I didn't find a way to make XFS to cache 
meta data permanenly and so far I didn't track that down any further.
For reference and without full bonnie output, with XFS I got about 800 
to 1000 creates/s.
Somewhat that seems to confirm my idea not to let file systems try to 
handle it themselves, but to introduce a generic way to cache meta data.

>
>>  From my point of view, there should be a small, but configurable,
>> number pages reserved for meta-data only. If streaming IO wouldn't
>> be able evict these pages, our and other file systems meta-cache
>> issues probably would be entire solved at all.
>
> That's effectively what XFS does automatically - it doesn't reserve
> pages, but it holds onto the frequently hit metadata buffers much,
> much harder than any other Linux filesystem....
>
>> Example:
>> ========
>>
>> Just a very basic simple bonnie++ test with 60000 files on ext4 with
>> inlined data to reduce block and bitmap lookups and writes.
>>
>> Entirely cached hash directories (16384), which are populated with
>> about 16 million files, so 1000 files per hash-dir.
>>
>>> Version  1.96       ------Sequential Create------ --------Random Create--------
>>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>>>            60:32:32  1702  14  2025  12  1332   4  1873  16  2047  13  1266   3
>>> Latency              3874ms    6645ms    8659ms     505ms    7257ms    9627ms
>>> 1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms
>
> Command line parameters, details of storage, the scripts you are
> running, etc please. RAM as well, as 16 million files are going to
> require at least 20GB RAM to fully cache...

16 million files are only laying around in the hash directories and are 
not touched at all when new files are created. So I don't know where you 
take 20GB from.
Our file names have a typical size of 21 bytes, so with a classical ext2 
layout that gives 29 bytes, with alignment that makes 32 bytes per 
directory entry. Ignoring '.' and '..' we need 125000 x 4kiB directory 
blocks, so about 500MB + some overhead.

>
> Numbers without context or with "handwavy context" are meaningless
> for the purpose of analysis and understanding.

I just wanted to show here, that creating new files introduces reads 
when meta-data have been evicted from the cache and how easily that can 
happen. From my point of view the hardware does not matter much for that 
purpose.
This was with rotating disks as typically used to store huge amounts of 
HPC data. With SSDs the effect would have been smaller, but even SSDs 
are not as fast as in-memory-cache lookups.

As you are asking:
These are pretty old systems from 2006 with 8GB RAM and 10 rotating 
disks in an (md) raid10.
Our customer systems usually have >=64GiB RAM and often _less_ than 16 
million files per server. But still meta-reads impact latency and 
streaming performance.

The bonnie++ output above basically told the parameters, but here the 
full command for reference:
bonnie++ -s 0 -u 65535 -g 65535 -n60:32:32:1 -d /mnt/fhgfs/

Please not that bonnie++ is not ideally suitable for meta-benchmarks, 
but as I said above, I just wanted to demonstrate cache evictions.


Cheers,
Bernd




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
@ 2013-07-22 12:36       ` Bernd Schubert
  0 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2013-07-22 12:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, linux-mm, Linux FS Devel, Mel Gorman, Andreas Dilger, sage

On 07/22/2013 02:47 AM, Dave Chinner wrote:
> On Fri, Jul 19, 2013 at 09:52:00PM +0200, Bernd Schubert wrote:
>> Hello Ric, hi all,
>>
>> On 07/12/2013 07:20 PM, Ric Wheeler wrote:
>>>
>>> If you have topics that you would like to add, wait until the
>>> instructions get posted at the link above. If you are impatient, feel
>>> free to email me directly (but probably best to drop the broad mailing
>>> lists from the reply).
>>
>> sorry, that will be a rather long introduction, the short conclusion
>> is below.
>>
>>
>> Introduction to the meta-cache issue:
>> =====================================
>> For quite a while we are redesigning our FhGFS storage layout to
>> workaround meta-cache issues of underlying file systems. However,
>> there are constraints as data and meta-data are distributed on
>> between several targets/servers. Other distributed file systems,
>> such as Lustre and (I think) cepfs should have the similar issues.
>>
>> So the main issue we have is that streaming reads/writes evict
>> meta-pages from the page-cache. I.e. this results in lots of
>> directory-block reads on creating files. So FhGFS, Lustre an (I
>> believe) cephfs are using hash-directories to store object files.
>> Access to files in these hash-directories is rather random and with
>> increasing number of files, access to hash directory-blocks/pages
>> also gets entirely random. Streaming IO easily evicts these pages,
>> which results in high latencies when users perform file
>> creates/deletes, as corresponding directory blocks have to be
>> re-read from disk again and again.
>
> Sounds like a filesystem specific problem. Different filesystems
> have different ways of caching metadata and respond differently to
> page cache pressure.
>
> For example, we changed XFS to have it's own metdata buffer cache
> reclaim mechanisms driven by a shrinker that uses prioritised cache
> reclaim to ensure we reclaim less important metadata buffers before
> ones that are more frequently hit (e.g. to reclaim tree leaves
> before nodes and roots). This was done because the page cache based
> reclaim of metadata was completely inadequate (i.e. mostly random!)
> and would frequently reclaim the wrong thing and cause performance
> under memory pressure to tank....

Well, especially with XFS I see reads all the time and btrace tells me 
these are meta-reads. So far I didn't find a way to make XFS to cache 
meta data permanenly and so far I didn't track that down any further.
For reference and without full bonnie output, with XFS I got about 800 
to 1000 creates/s.
Somewhat that seems to confirm my idea not to let file systems try to 
handle it themselves, but to introduce a generic way to cache meta data.

>
>>  From my point of view, there should be a small, but configurable,
>> number pages reserved for meta-data only. If streaming IO wouldn't
>> be able evict these pages, our and other file systems meta-cache
>> issues probably would be entire solved at all.
>
> That's effectively what XFS does automatically - it doesn't reserve
> pages, but it holds onto the frequently hit metadata buffers much,
> much harder than any other Linux filesystem....
>
>> Example:
>> ========
>>
>> Just a very basic simple bonnie++ test with 60000 files on ext4 with
>> inlined data to reduce block and bitmap lookups and writes.
>>
>> Entirely cached hash directories (16384), which are populated with
>> about 16 million files, so 1000 files per hash-dir.
>>
>>> Version  1.96       ------Sequential Create------ --------Random Create--------
>>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>>>            60:32:32  1702  14  2025  12  1332   4  1873  16  2047  13  1266   3
>>> Latency              3874ms    6645ms    8659ms     505ms    7257ms    9627ms
>>> 1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms
>
> Command line parameters, details of storage, the scripts you are
> running, etc please. RAM as well, as 16 million files are going to
> require at least 20GB RAM to fully cache...

16 million files are only laying around in the hash directories and are 
not touched at all when new files are created. So I don't know where you 
take 20GB from.
Our file names have a typical size of 21 bytes, so with a classical ext2 
layout that gives 29 bytes, with alignment that makes 32 bytes per 
directory entry. Ignoring '.' and '..' we need 125000 x 4kiB directory 
blocks, so about 500MB + some overhead.

>
> Numbers without context or with "handwavy context" are meaningless
> for the purpose of analysis and understanding.

I just wanted to show here, that creating new files introduces reads 
when meta-data have been evicted from the cache and how easily that can 
happen. From my point of view the hardware does not matter much for that 
purpose.
This was with rotating disks as typically used to store huge amounts of 
HPC data. With SSDs the effect would have been smaller, but even SSDs 
are not as fast as in-memory-cache lookups.

As you are asking:
These are pretty old systems from 2006 with 8GB RAM and 10 rotating 
disks in an (md) raid10.
Our customer systems usually have >=64GiB RAM and often _less_ than 16 
million files per server. But still meta-reads impact latency and 
streaming performance.

The bonnie++ output above basically told the parameters, but here the 
full command for reference:
bonnie++ -s 0 -u 65535 -g 65535 -n60:32:32:1 -d /mnt/fhgfs/

Please not that bonnie++ is not ideally suitable for meta-benchmarks, 
but as I said above, I just wanted to demonstrate cache evictions.


Cheers,
Bernd



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
  2013-07-22 12:36       ` Bernd Schubert
  (?)
@ 2013-07-23  6:25       ` Dave Chinner
  2013-07-26 14:35           ` Bernd Schubert
  -1 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2013-07-23  6:25 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Ric Wheeler, linux-mm, Linux FS Devel, Mel Gorman, Andreas Dilger, sage

On Mon, Jul 22, 2013 at 02:36:27PM +0200, Bernd Schubert wrote:
> On 07/22/2013 02:47 AM, Dave Chinner wrote:
> >On Fri, Jul 19, 2013 at 09:52:00PM +0200, Bernd Schubert wrote:
> >>Hello Ric, hi all,
> >>
> >>On 07/12/2013 07:20 PM, Ric Wheeler wrote:
> >>>
> >>>If you have topics that you would like to add, wait until the
> >>>instructions get posted at the link above. If you are impatient, feel
> >>>free to email me directly (but probably best to drop the broad mailing
> >>>lists from the reply).
> >>
> >>sorry, that will be a rather long introduction, the short conclusion
> >>is below.
> >>
> >>
> >>Introduction to the meta-cache issue:
> >>=====================================
> >>For quite a while we are redesigning our FhGFS storage layout to
> >>workaround meta-cache issues of underlying file systems. However,
> >>there are constraints as data and meta-data are distributed on
> >>between several targets/servers. Other distributed file systems,
> >>such as Lustre and (I think) cepfs should have the similar issues.
> >>
> >>So the main issue we have is that streaming reads/writes evict
> >>meta-pages from the page-cache. I.e. this results in lots of
> >>directory-block reads on creating files. So FhGFS, Lustre an (I
> >>believe) cephfs are using hash-directories to store object files.
> >>Access to files in these hash-directories is rather random and with
> >>increasing number of files, access to hash directory-blocks/pages
> >>also gets entirely random. Streaming IO easily evicts these pages,
> >>which results in high latencies when users perform file
> >>creates/deletes, as corresponding directory blocks have to be
> >>re-read from disk again and again.
> >
> >Sounds like a filesystem specific problem. Different filesystems
> >have different ways of caching metadata and respond differently to
> >page cache pressure.
> >
> >For example, we changed XFS to have it's own metdata buffer cache
> >reclaim mechanisms driven by a shrinker that uses prioritised cache
> >reclaim to ensure we reclaim less important metadata buffers before
> >ones that are more frequently hit (e.g. to reclaim tree leaves
> >before nodes and roots). This was done because the page cache based
> >reclaim of metadata was completely inadequate (i.e. mostly random!)
> >and would frequently reclaim the wrong thing and cause performance
> >under memory pressure to tank....
> 
> Well, especially with XFS I see reads all the time and btrace tells
> me these are meta-reads. So far I didn't find a way to make XFS to
> cache meta data permanenly and so far I didn't track that down any
> further.

Sure. That's what *I* want to confirm - what sort of metadata is
being read. And what I see is the inode and dentry caches getting
trashed, and that results in directory reads to repopulate the
dentry cache....

> For reference and without full bonnie output, with XFS I got about
> 800 to 1000 creates/s.
> Somewhat that seems to confirm my idea not to let file systems try
> to handle it themselves, but to introduce a generic way to cache
> meta data.

We already have generic metadata caches - the inode and dentry
caches.

The reason some filesystems have their own caches is that the
generic caches are not always suited to the physical metadata
structure of the filesystem, and hence they have their own
multi-level caches and reclaim implementations that are more optimal
than the generic cache mechanisms.

IOWs, there isn't an "optimal" generic metadata caching mechanism
that can be implemented.

> >>Entirely cached hash directories (16384), which are populated with
> >>about 16 million files, so 1000 files per hash-dir.
> >>
> >>>Version  1.96       ------Sequential Create------ --------Random Create--------
> >>>fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> >>>files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> >>>           60:32:32  1702  14  2025  12  1332   4  1873  16  2047  13  1266   3
> >>>Latency              3874ms    6645ms    8659ms     505ms    7257ms    9627ms
> >>>1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms
> >
> >Command line parameters, details of storage, the scripts you are
> >running, etc please. RAM as well, as 16 million files are going to
> >require at least 20GB RAM to fully cache...
> 
> 16 million files are only laying around in the hash directories and
> are not touched at all when new files are created. So I don't know
> where you take 20GB from.

Each inode in memory requires between 1-1.4k of memory depending on
the filesystem they belong to.  Then there's another ~200 bytes per
dentry per inode, and if the names are long enough, then another 64+
bytes for the name of the file held by the dentry.  So caching 16
million inodes (directory or files) requires 15-25GB of RAM to
cache.

FWIW, have you tried experimenting with
/proc/sys/vm/vfs_cache_pressure to change the ratio of metadata to
page cache reclaim? You might find that all you need to do is change
this ratio and your problem is solved.....

> Our file names have a typical size of 21 bytes, so with a classical
> ext2 layout that gives 29 bytes, with alignment that makes 32 bytes
> per directory entry. Ignoring '.' and '..' we need 125000 x 4kiB
> directory blocks, so about 500MB + some overhead.

If the dentry cache stays populated, then how the filesystem lays
out dirents on disk is irrelevant - you won't ever be reading them
more than once....

> >Numbers without context or with "handwavy context" are meaningless
> >for the purpose of analysis and understanding.
> 
> I just wanted to show here, that creating new files introduces reads
> when meta-data have been evicted from the cache and how easily that
> can happen. From my point of view the hardware does not matter much
> for that purpose.

In my experience, hardware always matters when you are asking
someone else to understand and reproduce your performance
problem. It's often the single most critical aspect that we need to
understand....

> This was with rotating disks as typically used to store huge amounts
> of HPC data. With SSDs the effect would have been smaller, but even
> SSDs are not as fast as in-memory-cache lookups.
....
> Our customer systems usually have >=64GiB RAM and often _less_ than
> 16 million files per server. But still meta-reads impact latency and
> streaming performance.
.....
> Please not that bonnie++ is not ideally suitable for
> meta-benchmarks, but as I said above, I just wanted to demonstrate
> cache evictions.

Sure. On the other hand, you're describing a well known workload and
memory pressure eviction pattern that can be entirely prevented from
userspace.  Do you reuse any of the data that is streamed to disk
before it is evicted from memory by other streaming data? I suspect
the data cache hit rate for the workloads you are describing (HPC
and bulk data storage) is around 0%.

If so, why aren't you making use of fadvise(DONTNEED) to tell the
kernel it doesn't need to cache that data that is being
read/written? That will prevent streaming Io from creating memory
pressure, and that will prevent the hot data and metadata caches
from being trashed by cold streaming IO. I know of several large
scale/distributed storage server implementations that do exactly
this... 

Remember: not all IO problems need to be solved by changing kernel
code ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
  2013-07-23  6:25       ` Dave Chinner
@ 2013-07-26 14:35           ` Bernd Schubert
  0 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2013-07-26 14:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, linux-mm, Linux FS Devel, Mel Gorman, Andreas Dilger, sage

On 07/23/2013 08:25 AM, Dave Chinner wrote:
> On Mon, Jul 22, 2013 at 02:36:27PM +0200, Bernd Schubert wrote:
>> On 07/22/2013 02:47 AM, Dave Chinner wrote:
>>> On Fri, Jul 19, 2013 at 09:52:00PM +0200, Bernd Schubert wrote:
>>>> Hello Ric, hi all,
>>>>
>>>> On 07/12/2013 07:20 PM, Ric Wheeler wrote:
>>>>>

[...]

>>> For example, we changed XFS to have it's own metdata buffer cache
>>> reclaim mechanisms driven by a shrinker that uses prioritised cache
>>> reclaim to ensure we reclaim less important metadata buffers before
>>> ones that are more frequently hit (e.g. to reclaim tree leaves
>>> before nodes and roots). This was done because the page cache based
>>> reclaim of metadata was completely inadequate (i.e. mostly random!)
>>> and would frequently reclaim the wrong thing and cause performance
>>> under memory pressure to tank....
>>
>> Well, especially with XFS I see reads all the time and btrace tells
>> me these are meta-reads. So far I didn't find a way to make XFS to
>> cache meta data permanenly and so far I didn't track that down any
>> further.
>
> Sure. That's what *I* want to confirm - what sort of metadata is
> being read. And what I see is the inode and dentry caches getting
> trashed, and that results in directory reads to repopulate the
> dentry cache....
>
>> For reference and without full bonnie output, with XFS I got about
>> 800 to 1000 creates/s.
>> Somewhat that seems to confirm my idea not to let file systems try
>> to handle it themselves, but to introduce a generic way to cache
>> meta data.
>
> We already have generic metadata caches - the inode and dentry
> caches.
>
> The reason some filesystems have their own caches is that the
> generic caches are not always suited to the physical metadata
> structure of the filesystem, and hence they have their own
> multi-level caches and reclaim implementations that are more optimal
> than the generic cache mechanisms.
>
> IOWs, there isn't an "optimal" generic metadata caching mechanism
> that can be implemented.

Maybe just the generic framework should be improved?

>
>>>> Entirely cached hash directories (16384), which are populated with
>>>> about 16 million files, so 1000 files per hash-dir.
>>>>
>>>>> Version  1.96       ------Sequential Create------ --------Random Create--------
>>>>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>>>>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>>>>>            60:32:32  1702  14  2025  12  1332   4  1873  16  2047  13  1266   3
>>>>> Latency              3874ms    6645ms    8659ms     505ms    7257ms    9627ms
>>>>> 1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms
>>>
>>> Command line parameters, details of storage, the scripts you are
>>> running, etc please. RAM as well, as 16 million files are going to
>>> require at least 20GB RAM to fully cache...
>>
>> 16 million files are only laying around in the hash directories and
>> are not touched at all when new files are created. So I don't know
>> where you take 20GB from.
>
> Each inode in memory requires between 1-1.4k of memory depending on
> the filesystem they belong to.  Then there's another ~200 bytes per
> dentry per inode, and if the names are long enough, then another 64+
> bytes for the name of the file held by the dentry.  So caching 16
> million inodes (directory or files) requires 15-25GB of RAM to
> cache.

Yes, but the 16 million files just lay around, I don't want to cache 
them. With ext4 it works fine just to cache corresponding disk directory 
blocks. So if a new file is created it can lookup from these blocks that 
the file does not exist.

>
> FWIW, have you tried experimenting with
> /proc/sys/vm/vfs_cache_pressure to change the ratio of metadata to
> page cache reclaim? You might find that all you need to do is change
> this ratio and your problem is solved.....

Did you try that with kernels < 3.11? I did and others did, see for 
example here https://nf.nci.org.au/training/talks/rjh.lug2011.pdf‎
In the past it did not help at all. However, and that is really good 
news, with 3.11 it eventually works. Probably due to Mel patches. Thanks 
Mel!

>
>> Our file names have a typical size of 21 bytes, so with a classical
>> ext2 layout that gives 29 bytes, with alignment that makes 32 bytes
>> per directory entry. Ignoring '.' and '..' we need 125000 x 4kiB
>> directory blocks, so about 500MB + some overhead.
>
> If the dentry cache stays populated, then how the filesystem lays
> out dirents on disk is irrelevant - you won't ever be reading them
> more than once....

How does the dentry cache help you for file creates of new files? The 
dentry cache cannot know if the file exists on disk or not? Or do you 
want to have a negative dentry cache of all possible file name combinations?

>
>>> Numbers without context or with "handwavy context" are meaningless
>>> for the purpose of analysis and understanding.
>>
>> I just wanted to show here, that creating new files introduces reads
>> when meta-data have been evicted from the cache and how easily that
>> can happen. From my point of view the hardware does not matter much
>> for that purpose.
>
> In my experience, hardware always matters when you are asking
> someone else to understand and reproduce your performance
> problem. It's often the single most critical aspect that we need to
> understand....
>
>> This was with rotating disks as typically used to store huge amounts
>> of HPC data. With SSDs the effect would have been smaller, but even
>> SSDs are not as fast as in-memory-cache lookups.
> ....
>> Our customer systems usually have >=64GiB RAM and often _less_ than
>> 16 million files per server. But still meta-reads impact latency and
>> streaming performance.
> .....
>> Please not that bonnie++ is not ideally suitable for
>> meta-benchmarks, but as I said above, I just wanted to demonstrate
>> cache evictions.
>
> Sure. On the other hand, you're describing a well known workload and
> memory pressure eviction pattern that can be entirely prevented from
> userspace.  Do you reuse any of the data that is streamed to disk
> before it is evicted from memory by other streaming data? I suspect
> the data cache hit rate for the workloads you are describing (HPC
> and bulk data storage) is around 0%.
>
> If so, why aren't you making use of fadvise(DONTNEED) to tell the
> kernel it doesn't need to cache that data that is being
> read/written? That will prevent streaming Io from creating memory
> pressure, and that will prevent the hot data and metadata caches
> from being trashed by cold streaming IO. I know of several large
> scale/distributed storage server implementations that do exactly
> this...

I'm afraid it is not that easy. For example we have several users 
running OpenFoam over FhGFS. And while I really think that someone 
should fix OpenFoams IO routines, OpenFoam as it is has cache hit of 
99%. So already due to this single program I cannot simple disable 
caching on the FhGFS storage side. And there are many other examples 
were caching helps. And then as  fadvise(DONTNEED) does not even notify 
the file systems, it does not help to implement an RPC for well behaved 
applications - there would be no code path to call this RPC.
I already thought some time ago to write a simple patch, but then the 
FhGFS client is not in the kernel and servers are closed source, so 
chances to get such a patch accepted without a user in the kernel are 
almost zero.

> Remember: not all IO problems need to be solved by changing kernel
> code ;)

Yes sure, therefore I'm working on different fhgfs storage layout to 
allow better caching. But I think it still would be useful if file 
systems could use a more suitable generic framework for caching their 
metadata and if admins would have better control over that.


Cheers,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Linux Plumbers IO & File System Micro-conference
@ 2013-07-26 14:35           ` Bernd Schubert
  0 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2013-07-26 14:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, linux-mm, Linux FS Devel, Mel Gorman, Andreas Dilger, sage

On 07/23/2013 08:25 AM, Dave Chinner wrote:
> On Mon, Jul 22, 2013 at 02:36:27PM +0200, Bernd Schubert wrote:
>> On 07/22/2013 02:47 AM, Dave Chinner wrote:
>>> On Fri, Jul 19, 2013 at 09:52:00PM +0200, Bernd Schubert wrote:
>>>> Hello Ric, hi all,
>>>>
>>>> On 07/12/2013 07:20 PM, Ric Wheeler wrote:
>>>>>

[...]

>>> For example, we changed XFS to have it's own metdata buffer cache
>>> reclaim mechanisms driven by a shrinker that uses prioritised cache
>>> reclaim to ensure we reclaim less important metadata buffers before
>>> ones that are more frequently hit (e.g. to reclaim tree leaves
>>> before nodes and roots). This was done because the page cache based
>>> reclaim of metadata was completely inadequate (i.e. mostly random!)
>>> and would frequently reclaim the wrong thing and cause performance
>>> under memory pressure to tank....
>>
>> Well, especially with XFS I see reads all the time and btrace tells
>> me these are meta-reads. So far I didn't find a way to make XFS to
>> cache meta data permanenly and so far I didn't track that down any
>> further.
>
> Sure. That's what *I* want to confirm - what sort of metadata is
> being read. And what I see is the inode and dentry caches getting
> trashed, and that results in directory reads to repopulate the
> dentry cache....
>
>> For reference and without full bonnie output, with XFS I got about
>> 800 to 1000 creates/s.
>> Somewhat that seems to confirm my idea not to let file systems try
>> to handle it themselves, but to introduce a generic way to cache
>> meta data.
>
> We already have generic metadata caches - the inode and dentry
> caches.
>
> The reason some filesystems have their own caches is that the
> generic caches are not always suited to the physical metadata
> structure of the filesystem, and hence they have their own
> multi-level caches and reclaim implementations that are more optimal
> than the generic cache mechanisms.
>
> IOWs, there isn't an "optimal" generic metadata caching mechanism
> that can be implemented.

Maybe just the generic framework should be improved?

>
>>>> Entirely cached hash directories (16384), which are populated with
>>>> about 16 million files, so 1000 files per hash-dir.
>>>>
>>>>> Version  1.96       ------Sequential Create------ --------Random Create--------
>>>>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>>>>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>>>>>            60:32:32  1702  14  2025  12  1332   4  1873  16  2047  13  1266   3
>>>>> Latency              3874ms    6645ms    8659ms     505ms    7257ms    9627ms
>>>>> 1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms
>>>
>>> Command line parameters, details of storage, the scripts you are
>>> running, etc please. RAM as well, as 16 million files are going to
>>> require at least 20GB RAM to fully cache...
>>
>> 16 million files are only laying around in the hash directories and
>> are not touched at all when new files are created. So I don't know
>> where you take 20GB from.
>
> Each inode in memory requires between 1-1.4k of memory depending on
> the filesystem they belong to.  Then there's another ~200 bytes per
> dentry per inode, and if the names are long enough, then another 64+
> bytes for the name of the file held by the dentry.  So caching 16
> million inodes (directory or files) requires 15-25GB of RAM to
> cache.

Yes, but the 16 million files just lay around, I don't want to cache 
them. With ext4 it works fine just to cache corresponding disk directory 
blocks. So if a new file is created it can lookup from these blocks that 
the file does not exist.

>
> FWIW, have you tried experimenting with
> /proc/sys/vm/vfs_cache_pressure to change the ratio of metadata to
> page cache reclaim? You might find that all you need to do is change
> this ratio and your problem is solved.....

Did you try that with kernels < 3.11? I did and others did, see for 
example here https://nf.nci.org.au/training/talks/rjh.lug2011.pdfa??
In the past it did not help at all. However, and that is really good 
news, with 3.11 it eventually works. Probably due to Mel patches. Thanks 
Mel!

>
>> Our file names have a typical size of 21 bytes, so with a classical
>> ext2 layout that gives 29 bytes, with alignment that makes 32 bytes
>> per directory entry. Ignoring '.' and '..' we need 125000 x 4kiB
>> directory blocks, so about 500MB + some overhead.
>
> If the dentry cache stays populated, then how the filesystem lays
> out dirents on disk is irrelevant - you won't ever be reading them
> more than once....

How does the dentry cache help you for file creates of new files? The 
dentry cache cannot know if the file exists on disk or not? Or do you 
want to have a negative dentry cache of all possible file name combinations?

>
>>> Numbers without context or with "handwavy context" are meaningless
>>> for the purpose of analysis and understanding.
>>
>> I just wanted to show here, that creating new files introduces reads
>> when meta-data have been evicted from the cache and how easily that
>> can happen. From my point of view the hardware does not matter much
>> for that purpose.
>
> In my experience, hardware always matters when you are asking
> someone else to understand and reproduce your performance
> problem. It's often the single most critical aspect that we need to
> understand....
>
>> This was with rotating disks as typically used to store huge amounts
>> of HPC data. With SSDs the effect would have been smaller, but even
>> SSDs are not as fast as in-memory-cache lookups.
> ....
>> Our customer systems usually have >=64GiB RAM and often _less_ than
>> 16 million files per server. But still meta-reads impact latency and
>> streaming performance.
> .....
>> Please not that bonnie++ is not ideally suitable for
>> meta-benchmarks, but as I said above, I just wanted to demonstrate
>> cache evictions.
>
> Sure. On the other hand, you're describing a well known workload and
> memory pressure eviction pattern that can be entirely prevented from
> userspace.  Do you reuse any of the data that is streamed to disk
> before it is evicted from memory by other streaming data? I suspect
> the data cache hit rate for the workloads you are describing (HPC
> and bulk data storage) is around 0%.
>
> If so, why aren't you making use of fadvise(DONTNEED) to tell the
> kernel it doesn't need to cache that data that is being
> read/written? That will prevent streaming Io from creating memory
> pressure, and that will prevent the hot data and metadata caches
> from being trashed by cold streaming IO. I know of several large
> scale/distributed storage server implementations that do exactly
> this...

I'm afraid it is not that easy. For example we have several users 
running OpenFoam over FhGFS. And while I really think that someone 
should fix OpenFoams IO routines, OpenFoam as it is has cache hit of 
99%. So already due to this single program I cannot simple disable 
caching on the FhGFS storage side. And there are many other examples 
were caching helps. And then as  fadvise(DONTNEED) does not even notify 
the file systems, it does not help to implement an RPC for well behaved 
applications - there would be no code path to call this RPC.
I already thought some time ago to write a simple patch, but then the 
FhGFS client is not in the kernel and servers are closed source, so 
chances to get such a patch accepted without a user in the kernel are 
almost zero.

> Remember: not all IO problems need to be solved by changing kernel
> code ;)

Yes sure, therefore I'm working on different fhgfs storage layout to 
allow better caching. But I think it still would be useful if file 
systems could use a more suitable generic framework for caching their 
metadata and if admins would have better control over that.


Cheers,
Bernd

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-07-26 14:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-12 17:20 Linux Plumbers IO & File System Micro-conference Ric Wheeler
2013-07-12 17:20 ` Ric Wheeler
2013-07-12 17:42 ` faibish, sorin
2013-07-12 17:42   ` faibish, sorin
2013-07-15 21:22   ` Ric Wheeler
2013-07-15 21:22     ` Ric Wheeler
2013-07-19 19:52 ` Bernd Schubert
2013-07-19 19:57   ` Ric Wheeler
2013-07-22  0:47   ` Dave Chinner
2013-07-22 12:36     ` Bernd Schubert
2013-07-22 12:36       ` Bernd Schubert
2013-07-23  6:25       ` Dave Chinner
2013-07-26 14:35         ` Bernd Schubert
2013-07-26 14:35           ` Bernd Schubert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.