All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] [ATTEND] Future writeback topics
@ 2012-01-22 13:50 ` Boaz Harrosh
  0 siblings, 0 replies; 17+ messages in thread
From: Boaz Harrosh @ 2012-01-22 13:50 UTC (permalink / raw)
  To: lsf-pc, linux-scsi, linux-fsdevel
  Cc: Jan Kara, Andrea Arcangeli, Wu Fengguang, Martin K. Petersen,
	Dave Chinner

Hi

Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
Are there plans for second stage? I can see few areas that need some love.

[IO Fairness, time sorted writeback, properly delayed writeback]

  As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
  I would like to propose the following topics:

* Do we have enough information for the time of dirty of pages, such as the
  IO-elevators information, readily available to be used at the VFS layer.
* BDI writeout should be smarter then a round robin cycle of SBs per BDI /
  inodes. It should be time based, writing the oldest data first.
  (Take the lowest indexed page of an inode as the dirty time of the inode.
   maybe also keep an oldest modified inode per-SB of a BDI)

  This can solve the IO fairness and latency bound (interactivness) of small
  IOs.
  There might be other solutions to this problem, any Ideas?

* Introduce an "aging time" factor of an inode which will postpone the writeout
  of an inode to the next writeback timer if the inode has "just changed".

  This can solve the problem of an application doing heavy modification of some
  area of a file and the writeback timer sampling that change too soon and forcing
  pages to change during IO, as well as having split IO where waiting for the next
  cycle could have the complete modification in a singe submit.


[Targeted writeback (IO-less page-reclaim)]
  Sometimes we would need to write a certain page or group of pages. It could be
  nice to prioritize/start the writeback on these pages, through the regular writeback
  mechanism instead of doing direct IO like today.

  This is actually related to above where we can have a "write_now" time constant that
  makes the priority of that inode to be written first. Then we also need the page-info
  that we want to write as part of that inode's IO. Usually today we start at the lowest
  indexed page of the inode, right? In targeted writeback we should make sure the writeout
  is the longest contiguous (aligned) dirty region containing the targeted page.

  With this in place we can also move to an IO-less page-reclaim. that is done entirely by
  the BDI thread writeback. (Need I say more)

[Aligned IO]

  Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
  and the VFS writeout can take that into consideration when submitting IO.

  This can both reduce lots of work done at individual filesystems, as well as benefit
  lots of other filesystems that did not take care of this. It can also make the life of
  some of the FSs that do care, a lot easier. Producing IO patterns that are much better
  then what can be achieved today with the FS trying to second guess the VFS.

[IO less sync]

  This topic is actually related to the above Aligned IO. 

  In today's code, in a regular write pattern, when an application is writing a long
  enough file, we have two sources of threads for the .write_pages vector. One is the
  BDI write_back thread, the other is the sync operation. This produces nightmarish IO
  patterns when the write_cache_pages() is re-entrant and each instance is fighting the
  other in garbing random pages, this is bad because of two reasons:
   1. makes each instance grab a none contiguous set of pages which causes the IO
      to split and be none-aligned.
   2. Causes Seeky IO where otherwise the application just wrote linear IO of
      a large file and then sync.

  The IO pattern is so bad that in some cases it is better to serialize the call to
  write_cache_pages() to avoid it. Even with the cost of a Mutex at every call

  Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
  and wait for it to finish? writeback in it's turn should switch to a sync mode on that
  inode. (The sync operation need not change the writeback priority in my opinion like
  today)

Thanks
Boaz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [LSF/MM TOPIC] [ATTEND] Future writeback topics
@ 2012-01-22 13:50 ` Boaz Harrosh
  0 siblings, 0 replies; 17+ messages in thread
From: Boaz Harrosh @ 2012-01-22 13:50 UTC (permalink / raw)
  To: lsf-pc, linux-scsi, linux-fsdevel
  Cc: Jan Kara, Andrea Arcangeli, Wu Fengguang, Martin K. Petersen,
	Dave Chinner

Hi

Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
Are there plans for second stage? I can see few areas that need some love.

[IO Fairness, time sorted writeback, properly delayed writeback]

  As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
  I would like to propose the following topics:

* Do we have enough information for the time of dirty of pages, such as the
  IO-elevators information, readily available to be used at the VFS layer.
* BDI writeout should be smarter then a round robin cycle of SBs per BDI /
  inodes. It should be time based, writing the oldest data first.
  (Take the lowest indexed page of an inode as the dirty time of the inode.
   maybe also keep an oldest modified inode per-SB of a BDI)

  This can solve the IO fairness and latency bound (interactivness) of small
  IOs.
  There might be other solutions to this problem, any Ideas?

* Introduce an "aging time" factor of an inode which will postpone the writeout
  of an inode to the next writeback timer if the inode has "just changed".

  This can solve the problem of an application doing heavy modification of some
  area of a file and the writeback timer sampling that change too soon and forcing
  pages to change during IO, as well as having split IO where waiting for the next
  cycle could have the complete modification in a singe submit.


[Targeted writeback (IO-less page-reclaim)]
  Sometimes we would need to write a certain page or group of pages. It could be
  nice to prioritize/start the writeback on these pages, through the regular writeback
  mechanism instead of doing direct IO like today.

  This is actually related to above where we can have a "write_now" time constant that
  makes the priority of that inode to be written first. Then we also need the page-info
  that we want to write as part of that inode's IO. Usually today we start at the lowest
  indexed page of the inode, right? In targeted writeback we should make sure the writeout
  is the longest contiguous (aligned) dirty region containing the targeted page.

  With this in place we can also move to an IO-less page-reclaim. that is done entirely by
  the BDI thread writeback. (Need I say more)

[Aligned IO]

  Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
  and the VFS writeout can take that into consideration when submitting IO.

  This can both reduce lots of work done at individual filesystems, as well as benefit
  lots of other filesystems that did not take care of this. It can also make the life of
  some of the FSs that do care, a lot easier. Producing IO patterns that are much better
  then what can be achieved today with the FS trying to second guess the VFS.

[IO less sync]

  This topic is actually related to the above Aligned IO. 

  In today's code, in a regular write pattern, when an application is writing a long
  enough file, we have two sources of threads for the .write_pages vector. One is the
  BDI write_back thread, the other is the sync operation. This produces nightmarish IO
  patterns when the write_cache_pages() is re-entrant and each instance is fighting the
  other in garbing random pages, this is bad because of two reasons:
   1. makes each instance grab a none contiguous set of pages which causes the IO
      to split and be none-aligned.
   2. Causes Seeky IO where otherwise the application just wrote linear IO of
      a large file and then sync.

  The IO pattern is so bad that in some cases it is better to serialize the call to
  write_cache_pages() to avoid it. Even with the cost of a Mutex at every call

  Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
  and wait for it to finish? writeback in it's turn should switch to a sync mode on that
  inode. (The sync operation need not change the writeback priority in my opinion like
  today)

Thanks
Boaz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-22 13:50 ` Boaz Harrosh
  (?)
@ 2012-01-22 14:49 ` James Bottomley
  2012-01-22 15:37     ` Boaz Harrosh
  -1 siblings, 1 reply; 17+ messages in thread
From: James Bottomley @ 2012-01-22 14:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: lsf-pc, linux-scsi, linux-fsdevel, Jan Kara, Andrea Arcangeli,
	Wu Fengguang, Martin K. Petersen, Dave Chinner, llinux-mm

Since a lot of these are mm related; added linux-mm to cc list

On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
> Hi
> 
> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
> Are there plans for second stage? I can see few areas that need some love.
> 
> [IO Fairness, time sorted writeback, properly delayed writeback]
> 
>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
>   I would like to propose the following topics:
> 
> * Do we have enough information for the time of dirty of pages, such as the
>   IO-elevators information, readily available to be used at the VFS layer.
> * BDI writeout should be smarter then a round robin cycle of SBs per BDI /
>   inodes. It should be time based, writing the oldest data first.
>   (Take the lowest indexed page of an inode as the dirty time of the inode.
>    maybe also keep an oldest modified inode per-SB of a BDI)
> 
>   This can solve the IO fairness and latency bound (interactivness) of small
>   IOs.
>   There might be other solutions to this problem, any Ideas?
> 
> * Introduce an "aging time" factor of an inode which will postpone the writeout
>   of an inode to the next writeback timer if the inode has "just changed".
> 
>   This can solve the problem of an application doing heavy modification of some
>   area of a file and the writeback timer sampling that change too soon and forcing
>   pages to change during IO, as well as having split IO where waiting for the next
>   cycle could have the complete modification in a singe submit.
> 
> 
> [Targeted writeback (IO-less page-reclaim)]
>   Sometimes we would need to write a certain page or group of pages. It could be
>   nice to prioritize/start the writeback on these pages, through the regular writeback
>   mechanism instead of doing direct IO like today.
> 
>   This is actually related to above where we can have a "write_now" time constant that
>   makes the priority of that inode to be written first. Then we also need the page-info
>   that we want to write as part of that inode's IO. Usually today we start at the lowest
>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
>   is the longest contiguous (aligned) dirty region containing the targeted page.
> 
>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
>   the BDI thread writeback. (Need I say more)

All of the above are complex.  The only reason for adding complexity in
our writeback path should be because we can demonstrate that it's
actually needed.  In order to demonstrate this, you'd need performance
measurements ... is there a plan to get these before the summit?

> [Aligned IO]
> 
>   Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
>   and the VFS writeout can take that into consideration when submitting IO.
> 
>   This can both reduce lots of work done at individual filesystems, as well as benefit
>   lots of other filesystems that did not take care of this. It can also make the life of
>   some of the FSs that do care, a lot easier. Producing IO patterns that are much better
>   then what can be achieved today with the FS trying to second guess the VFS.

Since a bdi is coupled to a gendisk and a queue, why isn't
optimal_io_size what you want?

> [IO less sync]
> 
>   This topic is actually related to the above Aligned IO. 
> 
>   In today's code, in a regular write pattern, when an application is writing a long
>   enough file, we have two sources of threads for the .write_pages vector. One is the
>   BDI write_back thread, the other is the sync operation. This produces nightmarish IO
>   patterns when the write_cache_pages() is re-entrant and each instance is fighting the
>   other in garbing random pages, this is bad because of two reasons:
>    1. makes each instance grab a none contiguous set of pages which causes the IO
>       to split and be none-aligned.
>    2. Causes Seeky IO where otherwise the application just wrote linear IO of
>       a large file and then sync.
> 
>   The IO pattern is so bad that in some cases it is better to serialize the call to
>   write_cache_pages() to avoid it. Even with the cost of a Mutex at every call
> 
>   Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
>   and wait for it to finish? writeback in it's turn should switch to a sync mode on that
>   inode. (The sync operation need not change the writeback priority in my opinion like
>   today)

This is essentially what we've been discussing in "Fixing Writeback" for
the last two years, isn't it (the fact that we have multiple sources of
writeback and they don't co-ordinate properly).  I thought our solution
was to prefer linear over seeky ... adding a mutex makes that more
absolute than a preference, but are you sure it helps (especially as it
adds a lock to the writeout path).

James



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-22 13:50 ` Boaz Harrosh
@ 2012-01-22 15:27   ` James Bottomley
  -1 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2012-01-22 15:27 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: lsf-pc, linux-scsi, linux-fsdevel, Jan Kara, Andrea Arcangeli,
	Wu Fengguang, Martin K. Petersen, Dave Chinner, linux-mm

Since a lot of these are mm related; added linux-mm to cc list

On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
> Hi
> 
> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
> Are there plans for second stage? I can see few areas that need some love.
> 
> [IO Fairness, time sorted writeback, properly delayed writeback]
> 
>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
>   I would like to propose the following topics:
> 
> * Do we have enough information for the time of dirty of pages, such as the
>   IO-elevators information, readily available to be used at the VFS layer.
> * BDI writeout should be smarter then a round robin cycle of SBs per BDI /
>   inodes. It should be time based, writing the oldest data first.
>   (Take the lowest indexed page of an inode as the dirty time of the inode.
>    maybe also keep an oldest modified inode per-SB of a BDI)
> 
>   This can solve the IO fairness and latency bound (interactivness) of small
>   IOs.
>   There might be other solutions to this problem, any Ideas?
> 
> * Introduce an "aging time" factor of an inode which will postpone the writeout
>   of an inode to the next writeback timer if the inode has "just changed".
> 
>   This can solve the problem of an application doing heavy modification of some
>   area of a file and the writeback timer sampling that change too soon and forcing
>   pages to change during IO, as well as having split IO where waiting for the next
>   cycle could have the complete modification in a singe submit.
> 
> 
> [Targeted writeback (IO-less page-reclaim)]
>   Sometimes we would need to write a certain page or group of pages. It could be
>   nice to prioritize/start the writeback on these pages, through the regular writeback
>   mechanism instead of doing direct IO like today.
> 
>   This is actually related to above where we can have a "write_now" time constant that
>   makes the priority of that inode to be written first. Then we also need the page-info
>   that we want to write as part of that inode's IO. Usually today we start at the lowest
>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
>   is the longest contiguous (aligned) dirty region containing the targeted page.
> 
>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
>   the BDI thread writeback. (Need I say more)

All of the above are complex.  The only reason for adding complexity in
our writeback path should be because we can demonstrate that it's
actually needed.  In order to demonstrate this, you'd need performance
measurements ... is there a plan to get these before the summit?

> [Aligned IO]
> 
>   Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
>   and the VFS writeout can take that into consideration when submitting IO.
> 
>   This can both reduce lots of work done at individual filesystems, as well as benefit
>   lots of other filesystems that did not take care of this. It can also make the life of
>   some of the FSs that do care, a lot easier. Producing IO patterns that are much better
>   then what can be achieved today with the FS trying to second guess the VFS.

Since a bdi is coupled to a gendisk and a queue, why isn't
optimal_io_size what you want?

> [IO less sync]
> 
>   This topic is actually related to the above Aligned IO. 
> 
>   In today's code, in a regular write pattern, when an application is writing a long
>   enough file, we have two sources of threads for the .write_pages vector. One is the
>   BDI write_back thread, the other is the sync operation. This produces nightmarish IO
>   patterns when the write_cache_pages() is re-entrant and each instance is fighting the
>   other in garbing random pages, this is bad because of two reasons:
>    1. makes each instance grab a none contiguous set of pages which causes the IO
>       to split and be none-aligned.
>    2. Causes Seeky IO where otherwise the application just wrote linear IO of
>       a large file and then sync.
> 
>   The IO pattern is so bad that in some cases it is better to serialize the call to
>   write_cache_pages() to avoid it. Even with the cost of a Mutex at every call
> 
>   Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
>   and wait for it to finish? writeback in it's turn should switch to a sync mode on that
>   inode. (The sync operation need not change the writeback priority in my opinion like
>   today)

This is essentially what we've been discussing in "Fixing Writeback" for
the last two years, isn't it (the fact that we have multiple sources of
writeback and they don't co-ordinate properly).  I thought our solution
was to prefer linear over seeky ... adding a mutex makes that more
absolute than a preference, but are you sure it helps (especially as it
adds a lock to the writeout path).

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
@ 2012-01-22 15:27   ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2012-01-22 15:27 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: lsf-pc, linux-scsi, linux-fsdevel, Jan Kara, Andrea Arcangeli,
	Wu Fengguang, Martin K. Petersen, Dave Chinner, linux-mm

Since a lot of these are mm related; added linux-mm to cc list

On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
> Hi
> 
> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
> Are there plans for second stage? I can see few areas that need some love.
> 
> [IO Fairness, time sorted writeback, properly delayed writeback]
> 
>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
>   I would like to propose the following topics:
> 
> * Do we have enough information for the time of dirty of pages, such as the
>   IO-elevators information, readily available to be used at the VFS layer.
> * BDI writeout should be smarter then a round robin cycle of SBs per BDI /
>   inodes. It should be time based, writing the oldest data first.
>   (Take the lowest indexed page of an inode as the dirty time of the inode.
>    maybe also keep an oldest modified inode per-SB of a BDI)
> 
>   This can solve the IO fairness and latency bound (interactivness) of small
>   IOs.
>   There might be other solutions to this problem, any Ideas?
> 
> * Introduce an "aging time" factor of an inode which will postpone the writeout
>   of an inode to the next writeback timer if the inode has "just changed".
> 
>   This can solve the problem of an application doing heavy modification of some
>   area of a file and the writeback timer sampling that change too soon and forcing
>   pages to change during IO, as well as having split IO where waiting for the next
>   cycle could have the complete modification in a singe submit.
> 
> 
> [Targeted writeback (IO-less page-reclaim)]
>   Sometimes we would need to write a certain page or group of pages. It could be
>   nice to prioritize/start the writeback on these pages, through the regular writeback
>   mechanism instead of doing direct IO like today.
> 
>   This is actually related to above where we can have a "write_now" time constant that
>   makes the priority of that inode to be written first. Then we also need the page-info
>   that we want to write as part of that inode's IO. Usually today we start at the lowest
>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
>   is the longest contiguous (aligned) dirty region containing the targeted page.
> 
>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
>   the BDI thread writeback. (Need I say more)

All of the above are complex.  The only reason for adding complexity in
our writeback path should be because we can demonstrate that it's
actually needed.  In order to demonstrate this, you'd need performance
measurements ... is there a plan to get these before the summit?

> [Aligned IO]
> 
>   Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
>   and the VFS writeout can take that into consideration when submitting IO.
> 
>   This can both reduce lots of work done at individual filesystems, as well as benefit
>   lots of other filesystems that did not take care of this. It can also make the life of
>   some of the FSs that do care, a lot easier. Producing IO patterns that are much better
>   then what can be achieved today with the FS trying to second guess the VFS.

Since a bdi is coupled to a gendisk and a queue, why isn't
optimal_io_size what you want?

> [IO less sync]
> 
>   This topic is actually related to the above Aligned IO. 
> 
>   In today's code, in a regular write pattern, when an application is writing a long
>   enough file, we have two sources of threads for the .write_pages vector. One is the
>   BDI write_back thread, the other is the sync operation. This produces nightmarish IO
>   patterns when the write_cache_pages() is re-entrant and each instance is fighting the
>   other in garbing random pages, this is bad because of two reasons:
>    1. makes each instance grab a none contiguous set of pages which causes the IO
>       to split and be none-aligned.
>    2. Causes Seeky IO where otherwise the application just wrote linear IO of
>       a large file and then sync.
> 
>   The IO pattern is so bad that in some cases it is better to serialize the call to
>   write_cache_pages() to avoid it. Even with the cost of a Mutex at every call
> 
>   Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
>   and wait for it to finish? writeback in it's turn should switch to a sync mode on that
>   inode. (The sync operation need not change the writeback priority in my opinion like
>   today)

This is essentially what we've been discussing in "Fixing Writeback" for
the last two years, isn't it (the fact that we have multiple sources of
writeback and they don't co-ordinate properly).  I thought our solution
was to prefer linear over seeky ... adding a mutex makes that more
absolute than a preference, but are you sure it helps (especially as it
adds a lock to the writeout path).

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-22 14:49 ` James Bottomley
@ 2012-01-22 15:37     ` Boaz Harrosh
  0 siblings, 0 replies; 17+ messages in thread
From: Boaz Harrosh @ 2012-01-22 15:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: lsf-pc, linux-scsi, linux-fsdevel, Jan Kara, Andrea Arcangeli,
	Wu Fengguang, Martin K. Petersen, Dave Chinner, llinux-mm

On 01/22/2012 04:49 PM, James Bottomley wrote:
> Since a lot of these are mm related; added linux-mm to cc list
> 

Hi James.

Thanks for reading, and sorry I missed linux-mm

> On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
>> Hi
>>
>> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
>> Are there plans for second stage? I can see few areas that need some love.
>>
>> [IO Fairness, time sorted writeback, properly delayed writeback]
>>
>>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
>>   I would like to propose the following topics:
>>
>> * Do we have enough information for the time of dirty of pages, such as the
>>   IO-elevators information, readily available to be used at the VFS layer.
>> * BDI writeout should be smarter then a round robin cycle of SBs per BDI /
>>   inodes. It should be time based, writing the oldest data first.
>>   (Take the lowest indexed page of an inode as the dirty time of the inode.
>>    maybe also keep an oldest modified inode per-SB of a BDI)
>>
>>   This can solve the IO fairness and latency bound (interactivness) of small
>>   IOs.
>>   There might be other solutions to this problem, any Ideas?
>>
>> * Introduce an "aging time" factor of an inode which will postpone the writeout
>>   of an inode to the next writeback timer if the inode has "just changed".
>>
>>   This can solve the problem of an application doing heavy modification of some
>>   area of a file and the writeback timer sampling that change too soon and forcing
>>   pages to change during IO, as well as having split IO where waiting for the next
>>   cycle could have the complete modification in a singe submit.
>>
>>
>> [Targeted writeback (IO-less page-reclaim)]
>>   Sometimes we would need to write a certain page or group of pages. It could be
>>   nice to prioritize/start the writeback on these pages, through the regular writeback
>>   mechanism instead of doing direct IO like today.
>>
>>   This is actually related to above where we can have a "write_now" time constant that
>>   makes the priority of that inode to be written first. Then we also need the page-info
>>   that we want to write as part of that inode's IO. Usually today we start at the lowest
>>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
>>   is the longest contiguous (aligned) dirty region containing the targeted page.
>>
>>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
>>   the BDI thread writeback. (Need I say more)
> 
> All of the above are complex.  The only reason for adding complexity in
> our writeback path should be because we can demonstrate that it's
> actually needed.  In order to demonstrate this, you'd need performance
> measurements ... is there a plan to get these before the summit?
> 

Some measurements have already been done and complained about. There were even attempts
at IO-less page-reclaim by Dave Chinner if I recall correctly. Mainly the complains I'm
addressing here are:
 1. Very bad IO patterns of page-reclaim and it's avoidance.
 2. The issue raised in that other thread about pages changing during IO penalty.
 3. Oblivious-ness of the VFS writeback to fairness and the starvation of small IOs
   in filesystems that are not block based.

But I agree much more testing is needed specially for 3. I can't promise I'll be up to it
for LSF.

Even more blasphemous of me is that I'm not the one that could code such changes,
I'm not familiar and capable with the VFS code to do such a task. I only know that as a
filesystem these are areas that are missed.

>> [Aligned IO]
>>
>>   Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
>>   and the VFS writeout can take that into consideration when submitting IO.
>>
>>   This can both reduce lots of work done at individual filesystems, as well as benefit
>>   lots of other filesystems that did not take care of this. It can also make the life of
>>   some of the FSs that do care, a lot easier. Producing IO patterns that are much better
>>   then what can be achieved today with the FS trying to second guess the VFS.
> 
> Since a bdi is coupled to a gendisk and a queue, why isn't
> optimal_io_size what you want?
> 

Exactly for block-based devices these are intended here. The "register block BDI" will
fill these in from there. It must be at the BDI level for these FSs that are not block
based but have similar alignment needs. And/or also filesystems that are multidevice
like BTRFS and ZFS(Fuse) which have conglomerated alignment needs.

>> [IO less sync]
>>
>>   This topic is actually related to the above Aligned IO. 
>>
>>   In today's code, in a regular write pattern, when an application is writing a long
>>   enough file, we have two sources of threads for the .write_pages vector. One is the
>>   BDI write_back thread, the other is the sync operation. This produces nightmarish IO
>>   patterns when the write_cache_pages() is re-entrant and each instance is fighting the
>>   other in garbing random pages, this is bad because of two reasons:
>>    1. makes each instance grab a none contiguous set of pages which causes the IO
>>       to split and be none-aligned.
>>    2. Causes Seeky IO where otherwise the application just wrote linear IO of
>>       a large file and then sync.
>>
>>   The IO pattern is so bad that in some cases it is better to serialize the call to
>>   write_cache_pages() to avoid it. Even with the cost of a Mutex at every call
>>
>>   Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
>>   and wait for it to finish? writeback in it's turn should switch to a sync mode on that
>>   inode. (The sync operation need not change the writeback priority in my opinion like
>>   today)
> 
> This is essentially what we've been discussing in "Fixing Writeback" for
> the last two years, isn't it (the fact that we have multiple sources of
> writeback and they don't co-ordinate properly).  I thought our solution
> was to prefer linear over seeky ... 

Yes. Lots of work has been done, and as part of that a tremendous clean up
has also been submitted and the code is kind of ready for the next round.

Some of these things we've been talking about for years as you said but are
not yet done. For example my problem of seeky IO when the application
just gave us perfectly linear writeout. This is why I said:
 Are we ready for the second round?

> adding a mutex makes that more
> absolute than a preference, but are you sure it helps (especially as it
> adds a lock to the writeout path).

No, I'm not sure at all. I just gave an example at some example filesystem
(exofs that I work on) where the penalty for non aligned IO is so bad (Raid 5)
that a Mutex at every IO gave better performance then the above problem. I have
not submitted this lock at the end because it is only for the large-file
IO case, so in the General workloads I could not prove if it's better or
not.

> 
> James
> 
> 

Thanks
Boaz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
@ 2012-01-22 15:37     ` Boaz Harrosh
  0 siblings, 0 replies; 17+ messages in thread
From: Boaz Harrosh @ 2012-01-22 15:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: lsf-pc, linux-scsi, linux-fsdevel, Jan Kara, Andrea Arcangeli,
	Wu Fengguang, Martin K. Petersen, Dave Chinner, llinux-mm

On 01/22/2012 04:49 PM, James Bottomley wrote:
> Since a lot of these are mm related; added linux-mm to cc list
> 

Hi James.

Thanks for reading, and sorry I missed linux-mm

> On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
>> Hi
>>
>> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
>> Are there plans for second stage? I can see few areas that need some love.
>>
>> [IO Fairness, time sorted writeback, properly delayed writeback]
>>
>>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
>>   I would like to propose the following topics:
>>
>> * Do we have enough information for the time of dirty of pages, such as the
>>   IO-elevators information, readily available to be used at the VFS layer.
>> * BDI writeout should be smarter then a round robin cycle of SBs per BDI /
>>   inodes. It should be time based, writing the oldest data first.
>>   (Take the lowest indexed page of an inode as the dirty time of the inode.
>>    maybe also keep an oldest modified inode per-SB of a BDI)
>>
>>   This can solve the IO fairness and latency bound (interactivness) of small
>>   IOs.
>>   There might be other solutions to this problem, any Ideas?
>>
>> * Introduce an "aging time" factor of an inode which will postpone the writeout
>>   of an inode to the next writeback timer if the inode has "just changed".
>>
>>   This can solve the problem of an application doing heavy modification of some
>>   area of a file and the writeback timer sampling that change too soon and forcing
>>   pages to change during IO, as well as having split IO where waiting for the next
>>   cycle could have the complete modification in a singe submit.
>>
>>
>> [Targeted writeback (IO-less page-reclaim)]
>>   Sometimes we would need to write a certain page or group of pages. It could be
>>   nice to prioritize/start the writeback on these pages, through the regular writeback
>>   mechanism instead of doing direct IO like today.
>>
>>   This is actually related to above where we can have a "write_now" time constant that
>>   makes the priority of that inode to be written first. Then we also need the page-info
>>   that we want to write as part of that inode's IO. Usually today we start at the lowest
>>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
>>   is the longest contiguous (aligned) dirty region containing the targeted page.
>>
>>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
>>   the BDI thread writeback. (Need I say more)
> 
> All of the above are complex.  The only reason for adding complexity in
> our writeback path should be because we can demonstrate that it's
> actually needed.  In order to demonstrate this, you'd need performance
> measurements ... is there a plan to get these before the summit?
> 

Some measurements have already been done and complained about. There were even attempts
at IO-less page-reclaim by Dave Chinner if I recall correctly. Mainly the complains I'm
addressing here are:
 1. Very bad IO patterns of page-reclaim and it's avoidance.
 2. The issue raised in that other thread about pages changing during IO penalty.
 3. Oblivious-ness of the VFS writeback to fairness and the starvation of small IOs
   in filesystems that are not block based.

But I agree much more testing is needed specially for 3. I can't promise I'll be up to it
for LSF.

Even more blasphemous of me is that I'm not the one that could code such changes,
I'm not familiar and capable with the VFS code to do such a task. I only know that as a
filesystem these are areas that are missed.

>> [Aligned IO]
>>
>>   Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
>>   and the VFS writeout can take that into consideration when submitting IO.
>>
>>   This can both reduce lots of work done at individual filesystems, as well as benefit
>>   lots of other filesystems that did not take care of this. It can also make the life of
>>   some of the FSs that do care, a lot easier. Producing IO patterns that are much better
>>   then what can be achieved today with the FS trying to second guess the VFS.
> 
> Since a bdi is coupled to a gendisk and a queue, why isn't
> optimal_io_size what you want?
> 

Exactly for block-based devices these are intended here. The "register block BDI" will
fill these in from there. It must be at the BDI level for these FSs that are not block
based but have similar alignment needs. And/or also filesystems that are multidevice
like BTRFS and ZFS(Fuse) which have conglomerated alignment needs.

>> [IO less sync]
>>
>>   This topic is actually related to the above Aligned IO. 
>>
>>   In today's code, in a regular write pattern, when an application is writing a long
>>   enough file, we have two sources of threads for the .write_pages vector. One is the
>>   BDI write_back thread, the other is the sync operation. This produces nightmarish IO
>>   patterns when the write_cache_pages() is re-entrant and each instance is fighting the
>>   other in garbing random pages, this is bad because of two reasons:
>>    1. makes each instance grab a none contiguous set of pages which causes the IO
>>       to split and be none-aligned.
>>    2. Causes Seeky IO where otherwise the application just wrote linear IO of
>>       a large file and then sync.
>>
>>   The IO pattern is so bad that in some cases it is better to serialize the call to
>>   write_cache_pages() to avoid it. Even with the cost of a Mutex at every call
>>
>>   Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
>>   and wait for it to finish? writeback in it's turn should switch to a sync mode on that
>>   inode. (The sync operation need not change the writeback priority in my opinion like
>>   today)
> 
> This is essentially what we've been discussing in "Fixing Writeback" for
> the last two years, isn't it (the fact that we have multiple sources of
> writeback and they don't co-ordinate properly).  I thought our solution
> was to prefer linear over seeky ... 

Yes. Lots of work has been done, and as part of that a tremendous clean up
has also been submitted and the code is kind of ready for the next round.

Some of these things we've been talking about for years as you said but are
not yet done. For example my problem of seeky IO when the application
just gave us perfectly linear writeout. This is why I said:
 Are we ready for the second round?

> adding a mutex makes that more
> absolute than a preference, but are you sure it helps (especially as it
> adds a lock to the writeout path).

No, I'm not sure at all. I just gave an example at some example filesystem
(exofs that I work on) where the penalty for non aligned IO is so bad (Raid 5)
that a Mutex at every IO gave better performance then the above problem. I have
not submitted this lock at the end because it is only for the large-file
IO case, so in the General workloads I could not prove if it's better or
not.

> 
> James
> 
> 

Thanks
Boaz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-22 15:37     ` Boaz Harrosh
@ 2012-01-22 15:49       ` James Bottomley
  -1 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2012-01-22 15:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Andrea Arcangeli, Wu Fengguang, Jan Kara, Martin K. Petersen,
	linux-scsi, Dave Chinner, linux-fsdevel, lsf-pc, linux-mm

[corrected linux-mm address I mistyped initially]
On Sun, 2012-01-22 at 17:37 +0200, Boaz Harrosh wrote:
> On 01/22/2012 04:49 PM, James Bottomley wrote:
> > Since a lot of these are mm related; added linux-mm to cc list
> > 
> 
> Hi James.
> 
> Thanks for reading, and sorry I missed linux-mm
> 
> > On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
> >> Hi
> >>
> >> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
> >> Are there plans for second stage? I can see few areas that need some love.
> >>
> >> [IO Fairness, time sorted writeback, properly delayed writeback]
> >>
> >>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
> >>   I would like to propose the following topics:
> >>
> >> * Do we have enough information for the time of dirty of pages, such as the
> >>   IO-elevators information, readily available to be used at the VFS layer.
> >> * BDI writeout should be smarter then a round robin cycle of SBs per BDI /
> >>   inodes. It should be time based, writing the oldest data first.
> >>   (Take the lowest indexed page of an inode as the dirty time of the inode.
> >>    maybe also keep an oldest modified inode per-SB of a BDI)
> >>
> >>   This can solve the IO fairness and latency bound (interactivness) of small
> >>   IOs.
> >>   There might be other solutions to this problem, any Ideas?
> >>
> >> * Introduce an "aging time" factor of an inode which will postpone the writeout
> >>   of an inode to the next writeback timer if the inode has "just changed".
> >>
> >>   This can solve the problem of an application doing heavy modification of some
> >>   area of a file and the writeback timer sampling that change too soon and forcing
> >>   pages to change during IO, as well as having split IO where waiting for the next
> >>   cycle could have the complete modification in a singe submit.
> >>
> >>
> >> [Targeted writeback (IO-less page-reclaim)]
> >>   Sometimes we would need to write a certain page or group of pages. It could be
> >>   nice to prioritize/start the writeback on these pages, through the regular writeback
> >>   mechanism instead of doing direct IO like today.
> >>
> >>   This is actually related to above where we can have a "write_now" time constant that
> >>   makes the priority of that inode to be written first. Then we also need the page-info
> >>   that we want to write as part of that inode's IO. Usually today we start at the lowest
> >>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
> >>   is the longest contiguous (aligned) dirty region containing the targeted page.
> >>
> >>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
> >>   the BDI thread writeback. (Need I say more)
> > 
> > All of the above are complex.  The only reason for adding complexity in
> > our writeback path should be because we can demonstrate that it's
> > actually needed.  In order to demonstrate this, you'd need performance
> > measurements ... is there a plan to get these before the summit?
> > 
> 
> Some measurements have already been done and complained about. There were even attempts
> at IO-less page-reclaim by Dave Chinner if I recall correctly. Mainly the complains I'm
> addressing here are:
>  1. Very bad IO patterns of page-reclaim and it's avoidance.
>  2. The issue raised in that other thread about pages changing during IO penalty.
>  3. Oblivious-ness of the VFS writeback to fairness and the starvation of small IOs
>    in filesystems that are not block based.
> 
> But I agree much more testing is needed specially for 3. I can't promise I'll be up to it
> for LSF.

As long as someone does them, I don't really care who.

> Even more blasphemous of me is that I'm not the one that could code such changes,
> I'm not familiar and capable with the VFS code to do such a task. I only know that as a
> filesystem these are areas that are missed.

Well, OK, we'll treat this as a Call for a Topic rather than a topic
(depending on whether someone is willing to do the work and talk about
it) ... or we can just fold it into the general writeback discussion ...
I'm sure there'll be one of those.

> >> [Aligned IO]
> >>
> >>   Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
> >>   and the VFS writeout can take that into consideration when submitting IO.
> >>
> >>   This can both reduce lots of work done at individual filesystems, as well as benefit
> >>   lots of other filesystems that did not take care of this. It can also make the life of
> >>   some of the FSs that do care, a lot easier. Producing IO patterns that are much better
> >>   then what can be achieved today with the FS trying to second guess the VFS.
> > 
> > Since a bdi is coupled to a gendisk and a queue, why isn't
> > optimal_io_size what you want?
> > 
> 
> Exactly for block-based devices these are intended here. The "register block BDI" will
> fill these in from there. It must be at the BDI level for these FSs that are not block
> based but have similar alignment needs. And/or also filesystems that are multidevice
> like BTRFS and ZFS(Fuse) which have conglomerated alignment needs.

But this topic then becomes adding alignment for non block backed
filesystems?  I take it you're thinking NFS rather than MTD or MMC?

For multiple devices, you do a simple cascade ... a bit like dm does
today ... but unless all the devices are aligned to optimal I/O it never
really works (and it's not necessarily worth solving ... the idea that
if you want performance from an array of devices, you match
characteristics isn't a hugely hard one to get the industry to swallow).

> >> [IO less sync]
> >>
> >>   This topic is actually related to the above Aligned IO. 
> >>
> >>   In today's code, in a regular write pattern, when an application is writing a long
> >>   enough file, we have two sources of threads for the .write_pages vector. One is the
> >>   BDI write_back thread, the other is the sync operation. This produces nightmarish IO
> >>   patterns when the write_cache_pages() is re-entrant and each instance is fighting the
> >>   other in garbing random pages, this is bad because of two reasons:
> >>    1. makes each instance grab a none contiguous set of pages which causes the IO
> >>       to split and be none-aligned.
> >>    2. Causes Seeky IO where otherwise the application just wrote linear IO of
> >>       a large file and then sync.
> >>
> >>   The IO pattern is so bad that in some cases it is better to serialize the call to
> >>   write_cache_pages() to avoid it. Even with the cost of a Mutex at every call
> >>
> >>   Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
> >>   and wait for it to finish? writeback in it's turn should switch to a sync mode on that
> >>   inode. (The sync operation need not change the writeback priority in my opinion like
> >>   today)
> > 
> > This is essentially what we've been discussing in "Fixing Writeback" for
> > the last two years, isn't it (the fact that we have multiple sources of
> > writeback and they don't co-ordinate properly).  I thought our solution
> > was to prefer linear over seeky ... 
> 
> Yes. Lots of work has been done, and as part of that a tremendous clean up
> has also been submitted and the code is kind of ready for the next round.
> 
> Some of these things we've been talking about for years as you said but are
> not yet done. For example my problem of seeky IO when the application
> just gave us perfectly linear writeout. This is why I said:
>  Are we ready for the second round?

OK, will defer to mm guys.

> > adding a mutex makes that more
> > absolute than a preference, but are you sure it helps (especially as it
> > adds a lock to the writeout path).
> 
> No, I'm not sure at all. I just gave an example at some example filesystem
> (exofs that I work on) where the penalty for non aligned IO is so bad (Raid 5)
> that a Mutex at every IO gave better performance then the above problem. I have
> not submitted this lock at the end because it is only for the large-file
> IO case, so in the General workloads I could not prove if it's better or
> not.

Global mutexes add a latency to the fast path ... this latency rises
with the NUMA ness or number of cores on the system ... that's why it
hit my "are you really sure" detector.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Future writeback topics
@ 2012-01-22 15:49       ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2012-01-22 15:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Andrea Arcangeli, Wu Fengguang, Jan Kara, Martin K. Petersen,
	linux-scsi, Dave Chinner, linux-fsdevel, lsf-pc, linux-mm

[corrected linux-mm address I mistyped initially]
On Sun, 2012-01-22 at 17:37 +0200, Boaz Harrosh wrote:
> On 01/22/2012 04:49 PM, James Bottomley wrote:
> > Since a lot of these are mm related; added linux-mm to cc list
> > 
> 
> Hi James.
> 
> Thanks for reading, and sorry I missed linux-mm
> 
> > On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
> >> Hi
> >>
> >> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
> >> Are there plans for second stage? I can see few areas that need some love.
> >>
> >> [IO Fairness, time sorted writeback, properly delayed writeback]
> >>
> >>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
> >>   I would like to propose the following topics:
> >>
> >> * Do we have enough information for the time of dirty of pages, such as the
> >>   IO-elevators information, readily available to be used at the VFS layer.
> >> * BDI writeout should be smarter then a round robin cycle of SBs per BDI /
> >>   inodes. It should be time based, writing the oldest data first.
> >>   (Take the lowest indexed page of an inode as the dirty time of the inode.
> >>    maybe also keep an oldest modified inode per-SB of a BDI)
> >>
> >>   This can solve the IO fairness and latency bound (interactivness) of small
> >>   IOs.
> >>   There might be other solutions to this problem, any Ideas?
> >>
> >> * Introduce an "aging time" factor of an inode which will postpone the writeout
> >>   of an inode to the next writeback timer if the inode has "just changed".
> >>
> >>   This can solve the problem of an application doing heavy modification of some
> >>   area of a file and the writeback timer sampling that change too soon and forcing
> >>   pages to change during IO, as well as having split IO where waiting for the next
> >>   cycle could have the complete modification in a singe submit.
> >>
> >>
> >> [Targeted writeback (IO-less page-reclaim)]
> >>   Sometimes we would need to write a certain page or group of pages. It could be
> >>   nice to prioritize/start the writeback on these pages, through the regular writeback
> >>   mechanism instead of doing direct IO like today.
> >>
> >>   This is actually related to above where we can have a "write_now" time constant that
> >>   makes the priority of that inode to be written first. Then we also need the page-info
> >>   that we want to write as part of that inode's IO. Usually today we start at the lowest
> >>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
> >>   is the longest contiguous (aligned) dirty region containing the targeted page.
> >>
> >>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
> >>   the BDI thread writeback. (Need I say more)
> > 
> > All of the above are complex.  The only reason for adding complexity in
> > our writeback path should be because we can demonstrate that it's
> > actually needed.  In order to demonstrate this, you'd need performance
> > measurements ... is there a plan to get these before the summit?
> > 
> 
> Some measurements have already been done and complained about. There were even attempts
> at IO-less page-reclaim by Dave Chinner if I recall correctly. Mainly the complains I'm
> addressing here are:
>  1. Very bad IO patterns of page-reclaim and it's avoidance.
>  2. The issue raised in that other thread about pages changing during IO penalty.
>  3. Oblivious-ness of the VFS writeback to fairness and the starvation of small IOs
>    in filesystems that are not block based.
> 
> But I agree much more testing is needed specially for 3. I can't promise I'll be up to it
> for LSF.

As long as someone does them, I don't really care who.

> Even more blasphemous of me is that I'm not the one that could code such changes,
> I'm not familiar and capable with the VFS code to do such a task. I only know that as a
> filesystem these are areas that are missed.

Well, OK, we'll treat this as a Call for a Topic rather than a topic
(depending on whether someone is willing to do the work and talk about
it) ... or we can just fold it into the general writeback discussion ...
I'm sure there'll be one of those.

> >> [Aligned IO]
> >>
> >>   Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
> >>   and the VFS writeout can take that into consideration when submitting IO.
> >>
> >>   This can both reduce lots of work done at individual filesystems, as well as benefit
> >>   lots of other filesystems that did not take care of this. It can also make the life of
> >>   some of the FSs that do care, a lot easier. Producing IO patterns that are much better
> >>   then what can be achieved today with the FS trying to second guess the VFS.
> > 
> > Since a bdi is coupled to a gendisk and a queue, why isn't
> > optimal_io_size what you want?
> > 
> 
> Exactly for block-based devices these are intended here. The "register block BDI" will
> fill these in from there. It must be at the BDI level for these FSs that are not block
> based but have similar alignment needs. And/or also filesystems that are multidevice
> like BTRFS and ZFS(Fuse) which have conglomerated alignment needs.

But this topic then becomes adding alignment for non block backed
filesystems?  I take it you're thinking NFS rather than MTD or MMC?

For multiple devices, you do a simple cascade ... a bit like dm does
today ... but unless all the devices are aligned to optimal I/O it never
really works (and it's not necessarily worth solving ... the idea that
if you want performance from an array of devices, you match
characteristics isn't a hugely hard one to get the industry to swallow).

> >> [IO less sync]
> >>
> >>   This topic is actually related to the above Aligned IO. 
> >>
> >>   In today's code, in a regular write pattern, when an application is writing a long
> >>   enough file, we have two sources of threads for the .write_pages vector. One is the
> >>   BDI write_back thread, the other is the sync operation. This produces nightmarish IO
> >>   patterns when the write_cache_pages() is re-entrant and each instance is fighting the
> >>   other in garbing random pages, this is bad because of two reasons:
> >>    1. makes each instance grab a none contiguous set of pages which causes the IO
> >>       to split and be none-aligned.
> >>    2. Causes Seeky IO where otherwise the application just wrote linear IO of
> >>       a large file and then sync.
> >>
> >>   The IO pattern is so bad that in some cases it is better to serialize the call to
> >>   write_cache_pages() to avoid it. Even with the cost of a Mutex at every call
> >>
> >>   Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
> >>   and wait for it to finish? writeback in it's turn should switch to a sync mode on that
> >>   inode. (The sync operation need not change the writeback priority in my opinion like
> >>   today)
> > 
> > This is essentially what we've been discussing in "Fixing Writeback" for
> > the last two years, isn't it (the fact that we have multiple sources of
> > writeback and they don't co-ordinate properly).  I thought our solution
> > was to prefer linear over seeky ... 
> 
> Yes. Lots of work has been done, and as part of that a tremendous clean up
> has also been submitted and the code is kind of ready for the next round.
> 
> Some of these things we've been talking about for years as you said but are
> not yet done. For example my problem of seeky IO when the application
> just gave us perfectly linear writeout. This is why I said:
>  Are we ready for the second round?

OK, will defer to mm guys.

> > adding a mutex makes that more
> > absolute than a preference, but are you sure it helps (especially as it
> > adds a lock to the writeout path).
> 
> No, I'm not sure at all. I just gave an example at some example filesystem
> (exofs that I work on) where the penalty for non aligned IO is so bad (Raid 5)
> that a Mutex at every IO gave better performance then the above problem. I have
> not submitted this lock at the end because it is only for the large-file
> IO case, so in the General workloads I could not prove if it's better or
> not.

Global mutexes add a latency to the fast path ... this latency rises
with the NUMA ness or number of cores on the system ... that's why it
hit my "are you really sure" detector.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-22 15:49       ` James Bottomley
@ 2012-01-22 22:11         ` Boaz Harrosh
  -1 siblings, 0 replies; 17+ messages in thread
From: Boaz Harrosh @ 2012-01-22 22:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrea Arcangeli, Wu Fengguang, Jan Kara, Martin K. Petersen,
	linux-scsi, Dave Chinner, linux-fsdevel, lsf-pc, linux-mm

On 01/22/2012 05:49 PM, James Bottomley wrote:
> 
> But this topic then becomes adding alignment for non block backed
> filesystems?  I take it you're thinking NFS rather than MTD or MMC?
> 

Sorry to differ. But no this is for most making the IO aligned in the first
place. Block-dev or not. Today VFS has no notion of alignment and IO is
submitted as is with out any alignment considerations.

> For multiple devices, you do a simple cascade ... a bit like dm does
> today ... but unless all the devices are aligned to optimal I/O it never
> really works (and it's not necessarily worth solving ... the idea that
> if you want performance from an array of devices, you match
> characteristics isn't a hugely hard one to get the industry to swallow).
> 

No I'm talking about raid configurations like object raid in exofs/NFS or
raid0/5 in BTRFS and ZFS and such, where there are other larger alignment
structures to consider. Also for large-blocks filesystems/devices who
would like IO aligned on bigger than a page sizes.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Future writeback topics
@ 2012-01-22 22:11         ` Boaz Harrosh
  0 siblings, 0 replies; 17+ messages in thread
From: Boaz Harrosh @ 2012-01-22 22:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrea Arcangeli, Wu Fengguang, Jan Kara, Martin K. Petersen,
	linux-scsi, Dave Chinner, linux-fsdevel, lsf-pc, linux-mm

On 01/22/2012 05:49 PM, James Bottomley wrote:
> 
> But this topic then becomes adding alignment for non block backed
> filesystems?  I take it you're thinking NFS rather than MTD or MMC?
> 

Sorry to differ. But no this is for most making the IO aligned in the first
place. Block-dev or not. Today VFS has no notion of alignment and IO is
submitted as is with out any alignment considerations.

> For multiple devices, you do a simple cascade ... a bit like dm does
> today ... but unless all the devices are aligned to optimal I/O it never
> really works (and it's not necessarily worth solving ... the idea that
> if you want performance from an array of devices, you match
> characteristics isn't a hugely hard one to get the industry to swallow).
> 

No I'm talking about raid configurations like object raid in exofs/NFS or
raid0/5 in BTRFS and ZFS and such, where there are other larger alignment
structures to consider. Also for large-blocks filesystems/devices who
would like IO aligned on bigger than a page sizes.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-22 15:27   ` James Bottomley
@ 2012-01-23 12:33     ` Johannes Weiner
  -1 siblings, 0 replies; 17+ messages in thread
From: Johannes Weiner @ 2012-01-23 12:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Boaz Harrosh, lsf-pc, linux-scsi, linux-fsdevel, Jan Kara,
	Andrea Arcangeli, Wu Fengguang, Martin K. Petersen, Dave Chinner,
	linux-mm

On Sun, Jan 22, 2012 at 09:27:14AM -0600, James Bottomley wrote:
> Since a lot of these are mm related; added linux-mm to cc list
> 
> On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
> > [Targeted writeback (IO-less page-reclaim)]
> >   Sometimes we would need to write a certain page or group of pages. It could be
> >   nice to prioritize/start the writeback on these pages, through the regular writeback
> >   mechanism instead of doing direct IO like today.
> > 
> >   This is actually related to above where we can have a "write_now" time constant that
> >   makes the priority of that inode to be written first. Then we also need the page-info
> >   that we want to write as part of that inode's IO. Usually today we start at the lowest
> >   indexed page of the inode, right? In targeted writeback we should make sure the writeout
> >   is the longest contiguous (aligned) dirty region containing the targeted page.
> > 
> >   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
> >   the BDI thread writeback. (Need I say more)
>
> All of the above are complex.  The only reason for adding complexity in
> our writeback path should be because we can demonstrate that it's
> actually needed.  In order to demonstrate this, you'd need performance
> measurements ... is there a plan to get these before the summit?

The situations that required writeback for reclaim to make progress
have shrunk a lot with this merge window because of respecting page
reserves in the dirty limits, and per-zone dirty limits.

What's left to evaluate are certain NUMA configurations where the
dirty pages are concentrated on a few nodes.  Currently, we kick the
flushers from direct reclaim, completely undirected, just "clean some
pages, please".  That works for systems up to a certain size,
depending on the size of the node in relationship to the system as a
whole (likelihood of pages cleaned being from the target node) and how
fast the backing storage is (impact of cleaning 'wrong' pages).

So while the original problem is still standing, the urgency of it
might have been reduced quite a bit or the problem itself might have
been pushed into a corner where workarounds (spread dirty data more
evenly e.g.) might be more economical than trying to make writeback
node-aware and deal with all the implications (still have to guarantee
dirty cache expiration times for integrity; can fail spectacularly
when there is little or no relationship between disk placement and
memory placement, imagine round-robin allocation of disk-contiguous
dirty cache over a few nodes).

I agree with James: find scenarios where workarounds are not feasible
but that are important enough that the complexity would be justified.
Otherwise, talking about how to fix them is moot.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
@ 2012-01-23 12:33     ` Johannes Weiner
  0 siblings, 0 replies; 17+ messages in thread
From: Johannes Weiner @ 2012-01-23 12:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Boaz Harrosh, lsf-pc, linux-scsi, linux-fsdevel, Jan Kara,
	Andrea Arcangeli, Wu Fengguang, Martin K. Petersen, Dave Chinner,
	linux-mm

On Sun, Jan 22, 2012 at 09:27:14AM -0600, James Bottomley wrote:
> Since a lot of these are mm related; added linux-mm to cc list
> 
> On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
> > [Targeted writeback (IO-less page-reclaim)]
> >   Sometimes we would need to write a certain page or group of pages. It could be
> >   nice to prioritize/start the writeback on these pages, through the regular writeback
> >   mechanism instead of doing direct IO like today.
> > 
> >   This is actually related to above where we can have a "write_now" time constant that
> >   makes the priority of that inode to be written first. Then we also need the page-info
> >   that we want to write as part of that inode's IO. Usually today we start at the lowest
> >   indexed page of the inode, right? In targeted writeback we should make sure the writeout
> >   is the longest contiguous (aligned) dirty region containing the targeted page.
> > 
> >   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
> >   the BDI thread writeback. (Need I say more)
>
> All of the above are complex.  The only reason for adding complexity in
> our writeback path should be because we can demonstrate that it's
> actually needed.  In order to demonstrate this, you'd need performance
> measurements ... is there a plan to get these before the summit?

The situations that required writeback for reclaim to make progress
have shrunk a lot with this merge window because of respecting page
reserves in the dirty limits, and per-zone dirty limits.

What's left to evaluate are certain NUMA configurations where the
dirty pages are concentrated on a few nodes.  Currently, we kick the
flushers from direct reclaim, completely undirected, just "clean some
pages, please".  That works for systems up to a certain size,
depending on the size of the node in relationship to the system as a
whole (likelihood of pages cleaned being from the target node) and how
fast the backing storage is (impact of cleaning 'wrong' pages).

So while the original problem is still standing, the urgency of it
might have been reduced quite a bit or the problem itself might have
been pushed into a corner where workarounds (spread dirty data more
evenly e.g.) might be more economical than trying to make writeback
node-aware and deal with all the implications (still have to guarantee
dirty cache expiration times for integrity; can fail spectacularly
when there is little or no relationship between disk placement and
memory placement, imagine round-robin allocation of disk-contiguous
dirty cache over a few nodes).

I agree with James: find scenarios where workarounds are not feasible
but that are important enough that the complexity would be justified.
Otherwise, talking about how to fix them is moot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-23 12:33     ` Johannes Weiner
@ 2012-01-23 13:41       ` Boaz Harrosh
  -1 siblings, 0 replies; 17+ messages in thread
From: Boaz Harrosh @ 2012-01-23 13:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: James Bottomley, lsf-pc, linux-scsi, linux-fsdevel, Jan Kara,
	Andrea Arcangeli, Wu Fengguang, Martin K. Petersen, Dave Chinner,
	linux-mm

On 01/23/2012 02:33 PM, Johannes Weiner wrote:
> On Sun, Jan 22, 2012 at 09:27:14AM -0600, James Bottomley wrote:
>> Since a lot of these are mm related; added linux-mm to cc list
>>
>> On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
>>> [Targeted writeback (IO-less page-reclaim)]
>>>   Sometimes we would need to write a certain page or group of pages. It could be
>>>   nice to prioritize/start the writeback on these pages, through the regular writeback
>>>   mechanism instead of doing direct IO like today.
>>>
>>>   This is actually related to above where we can have a "write_now" time constant that
>>>   makes the priority of that inode to be written first. Then we also need the page-info
>>>   that we want to write as part of that inode's IO. Usually today we start at the lowest
>>>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
>>>   is the longest contiguous (aligned) dirty region containing the targeted page.
>>>
>>>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
>>>   the BDI thread writeback. (Need I say more)
>>
>> All of the above are complex.  The only reason for adding complexity in
>> our writeback path should be because we can demonstrate that it's
>> actually needed.  In order to demonstrate this, you'd need performance
>> measurements ... is there a plan to get these before the summit?
> 
> The situations that required writeback for reclaim to make progress
> have shrunk a lot with this merge window because of respecting page
> reserves in the dirty limits, and per-zone dirty limits.
> 
> What's left to evaluate are certain NUMA configurations where the
> dirty pages are concentrated on a few nodes.  Currently, we kick the
> flushers from direct reclaim, completely undirected, just "clean some
> pages, please".  That works for systems up to a certain size,
> depending on the size of the node in relationship to the system as a
> whole (likelihood of pages cleaned being from the target node) and how
> fast the backing storage is (impact of cleaning 'wrong' pages).
> 
> So while the original problem is still standing, the urgency of it
> might have been reduced quite a bit or the problem itself might have
> been pushed into a corner where workarounds (spread dirty data more
> evenly e.g.) might be more economical than trying to make writeback
> node-aware and deal with all the implications (still have to guarantee
> dirty cache expiration times for integrity; can fail spectacularly
> when there is little or no relationship between disk placement and
> memory placement, imagine round-robin allocation of disk-contiguous
> dirty cache over a few nodes).
> 
> I agree with James: find scenarios where workarounds are not feasible
> but that are important enough that the complexity would be justified.
> Otherwise, talking about how to fix them is moot.

Fine so IO-less page-reclaim is moot. What do I know I've never seen
a NUMA machine. But that was just a by product of half a section
of a list of 8 sections. Are all these moot? I must be smoking something
good ;-)

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
@ 2012-01-23 13:41       ` Boaz Harrosh
  0 siblings, 0 replies; 17+ messages in thread
From: Boaz Harrosh @ 2012-01-23 13:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: James Bottomley, lsf-pc, linux-scsi, linux-fsdevel, Jan Kara,
	Andrea Arcangeli, Wu Fengguang, Martin K. Petersen, Dave Chinner,
	linux-mm

On 01/23/2012 02:33 PM, Johannes Weiner wrote:
> On Sun, Jan 22, 2012 at 09:27:14AM -0600, James Bottomley wrote:
>> Since a lot of these are mm related; added linux-mm to cc list
>>
>> On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote:
>>> [Targeted writeback (IO-less page-reclaim)]
>>>   Sometimes we would need to write a certain page or group of pages. It could be
>>>   nice to prioritize/start the writeback on these pages, through the regular writeback
>>>   mechanism instead of doing direct IO like today.
>>>
>>>   This is actually related to above where we can have a "write_now" time constant that
>>>   makes the priority of that inode to be written first. Then we also need the page-info
>>>   that we want to write as part of that inode's IO. Usually today we start at the lowest
>>>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
>>>   is the longest contiguous (aligned) dirty region containing the targeted page.
>>>
>>>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
>>>   the BDI thread writeback. (Need I say more)
>>
>> All of the above are complex.  The only reason for adding complexity in
>> our writeback path should be because we can demonstrate that it's
>> actually needed.  In order to demonstrate this, you'd need performance
>> measurements ... is there a plan to get these before the summit?
> 
> The situations that required writeback for reclaim to make progress
> have shrunk a lot with this merge window because of respecting page
> reserves in the dirty limits, and per-zone dirty limits.
> 
> What's left to evaluate are certain NUMA configurations where the
> dirty pages are concentrated on a few nodes.  Currently, we kick the
> flushers from direct reclaim, completely undirected, just "clean some
> pages, please".  That works for systems up to a certain size,
> depending on the size of the node in relationship to the system as a
> whole (likelihood of pages cleaned being from the target node) and how
> fast the backing storage is (impact of cleaning 'wrong' pages).
> 
> So while the original problem is still standing, the urgency of it
> might have been reduced quite a bit or the problem itself might have
> been pushed into a corner where workarounds (spread dirty data more
> evenly e.g.) might be more economical than trying to make writeback
> node-aware and deal with all the implications (still have to guarantee
> dirty cache expiration times for integrity; can fail spectacularly
> when there is little or no relationship between disk placement and
> memory placement, imagine round-robin allocation of disk-contiguous
> dirty cache over a few nodes).
> 
> I agree with James: find scenarios where workarounds are not feasible
> but that are important enough that the complexity would be justified.
> Otherwise, talking about how to fix them is moot.

Fine so IO-less page-reclaim is moot. What do I know I've never seen
a NUMA machine. But that was just a by product of half a section
of a list of 8 sections. Are all these moot? I must be smoking something
good ;-)

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-22 13:50 ` Boaz Harrosh
                   ` (2 preceding siblings ...)
  (?)
@ 2012-01-23 18:15 ` Jan Kara
  -1 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2012-01-23 18:15 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: lsf-pc, linux-scsi, linux-fsdevel, Jan Kara, Andrea Arcangeli,
	Wu Fengguang, Martin K. Petersen, Dave Chinner

On Sun 22-01-12 15:50:20, Boaz Harrosh wrote:
> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
> Are there plans for second stage? I can see few areas that need some love.
> 
> [IO Fairness, time sorted writeback, properly delayed writeback]
> 
>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
>   I would like to propose the following topics:
> 
> * Do we have enough information for the time of dirty of pages, such as the
>   IO-elevators information, readily available to be used at the VFS layer.
> * BDI writeout should be smarter then a round robin cycle of SBs per BDI /
>   inodes. It should be time based, writing the oldest data first.
>   (Take the lowest indexed page of an inode as the dirty time of the inode.
>    maybe also keep an oldest modified inode per-SB of a BDI)
  As I wrote in the other thread, we are a bit smarter by using
i_dirtied_when timestamp. But not much. But it's hard to do without
introducing rather big memory cost (e.g. something like per-page timestamps
which you suggest). So if you have some solution without big overhead then
I'm happy to listen to that.

>   This can solve the IO fairness and latency bound (interactivness) of small
>   IOs.
  As I also said in the other thread writeback IMO isn't the right place to
solve problems of small vs big IO. Writeback should more or less guarantee
that data get to disk before certain time to assure reasonable behavior
after crash. We also try to be fair among files but that's basically our
way how to get data to disk early enough. I don't know about any other
fairness that would make sense to be handled in writeback code.

>   There might be other solutions to this problem, any Ideas?
> 
> * Introduce an "aging time" factor of an inode which will postpone the writeout
>   of an inode to the next writeback timer if the inode has "just changed".
> 
>   This can solve the problem of an application doing heavy modification of some
>   area of a file and the writeback timer sampling that change too soon and forcing
>   pages to change during IO, as well as having split IO where waiting for the next
>   cycle could have the complete modification in a singe submit.
  But it also brings some problems - like avoiding to postpone writeback
forever. The devil is in the details here I believe and I thought about
similar ideas some time ago and I didn't come up with anything reasonably
simple and working better than current simple scheme.

> [Targeted writeback (IO-less page-reclaim)]
>   Sometimes we would need to write a certain page or group of pages. It could be
>   nice to prioritize/start the writeback on these pages, through the regular writeback
>   mechanism instead of doing direct IO like today.
> 
>   This is actually related to above where we can have a "write_now" time constant that
>   makes the priority of that inode to be written first. Then we also need the page-info
>   that we want to write as part of that inode's IO. Usually today we start at the lowest
>   indexed page of the inode, right? In targeted writeback we should make sure the writeout
>   is the longest contiguous (aligned) dirty region containing the targeted page.
> 
>   With this in place we can also move to an IO-less page-reclaim. that is done entirely by
>   the BDI thread writeback. (Need I say more)
  Again, expensive to track IMHO. Also as Johannes wrote, IO-less
page-reclaim may be less urgent in recent kernels.
 
> [Aligned IO]
> 
>   Each BDI should have a way to specify it's Alignment preferences and optimum IO sizes
>   and the VFS writeout can take that into consideration when submitting IO.
> 
>   This can both reduce lots of work done at individual filesystems, as well as benefit
>   lots of other filesystems that did not take care of this. It can also make the life of
>   some of the FSs that do care, a lot easier. Producing IO patterns that are much better
>   then what can be achieved today with the FS trying to second guess the VFS.
  This is probably doable and may be reasonable. Just currently writeback
code has no idea where particular page lands on disk (mapping logical
offset->physical block is in filesystem hands). But this might be
reasonably doable. Just someone has to write a code to expose enough
information from filesystems to writeback.
 
> [IO less sync]
> 
>   This topic is actually related to the above Aligned IO. 
> 
>   In today's code, in a regular write pattern, when an application is writing a long
>   enough file, we have two sources of threads for the .write_pages vector. One is the
>   BDI write_back thread, the other is the sync operation. This produces nightmarish IO
>   patterns when the write_cache_pages() is re-entrant and each instance is fighting the
>   other in garbing random pages, this is bad because of two reasons:
>    1. makes each instance grab a none contiguous set of pages which causes the IO
>       to split and be none-aligned.
>    2. Causes Seeky IO where otherwise the application just wrote linear IO of
>       a large file and then sync.
> 
>   The IO pattern is so bad that in some cases it is better to serialize the call to
>   write_cache_pages() to avoid it. Even with the cost of a Mutex at every call
> 
>   Would it be hard to have "sync" set some info, raise a flag, fire up the writeback
>   and wait for it to finish? writeback in it's turn should switch to a sync mode on that
>   inode. (The sync operation need not change the writeback priority in my opinion like
>   today)
  We already have I_SYNC inode flag for this used in
writeback_single_inode(). Just fsync path currently seems to avoid
writeback_single_inode() so that exclusion doesn't quite work it. I'm not
sure if I_SYNC was originally intended to provide exlusion against fsync.
But in either case I belive this particular problem can be somehow
resolved.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] [ATTEND] Future writeback topics
  2012-01-22 13:50 ` Boaz Harrosh
                   ` (3 preceding siblings ...)
  (?)
@ 2012-01-23 20:19 ` Vivek Goyal
  -1 siblings, 0 replies; 17+ messages in thread
From: Vivek Goyal @ 2012-01-23 20:19 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: lsf-pc, linux-scsi, linux-fsdevel, Andrea Arcangeli,
	Dave Chinner, Wu Fengguang, Jan Kara, Martin K. Petersen

On Sun, Jan 22, 2012 at 03:50:20PM +0200, Boaz Harrosh wrote:
> Hi
> 
> Now that we have the "IO-less dirty throttling" in and kicking (ass I might say)
> Are there plans for second stage? I can see few areas that need some love.
> 
> [IO Fairness, time sorted writeback, properly delayed writeback]
> 
>   As we started to talk about in another thread: "[LSF/MM TOPIC] a few storage topics"
>   I would like to propose the following topics:
> 
> * Do we have enough information for the time of dirty of pages, such as the
>   IO-elevators information, readily available to be used at the VFS layer.

Assuming it is available, what's the plan. How would VFS layer make use
of it?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-01-23 20:19 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-22 13:50 [LSF/MM TOPIC] [ATTEND] Future writeback topics Boaz Harrosh
2012-01-22 13:50 ` Boaz Harrosh
2012-01-22 14:49 ` James Bottomley
2012-01-22 15:37   ` Boaz Harrosh
2012-01-22 15:37     ` Boaz Harrosh
2012-01-22 15:49     ` [Lsf-pc] " James Bottomley
2012-01-22 15:49       ` James Bottomley
2012-01-22 22:11       ` Boaz Harrosh
2012-01-22 22:11         ` Boaz Harrosh
2012-01-22 15:27 ` James Bottomley
2012-01-22 15:27   ` James Bottomley
2012-01-23 12:33   ` Johannes Weiner
2012-01-23 12:33     ` Johannes Weiner
2012-01-23 13:41     ` Boaz Harrosh
2012-01-23 13:41       ` Boaz Harrosh
2012-01-23 18:15 ` Jan Kara
2012-01-23 20:19 ` [Lsf-pc] " Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.