All of lore.kernel.org
 help / color / mirror / Atom feed
* Latency writing to an mlocked ext4 mapping
@ 2011-10-20  0:39 ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  0:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-ext4

I have a real-time program that has everything mlocked (i.e.
mlockall(MCL_CURRENT | MCL_FUTURE)).  It has some log files opened for
writing.  Those files are opened and memset to zero in another thread
to fault everything in.  The system is under light I/O load with very
little memory pressure.

Latencytop shows frequent latency in the real-time threads.  The main
offenders are:

schedule sleep_on_page wait_on_page_bit ext4_page_mkwrite do_wp_page
handle_pte_fault handle_mm_fault do_page_fault page_fault

schedule do_get_write_access jbd2_journal_get_write_access
__ext4_journal_get_write_access ext4_reserve_inode_write
ext4_mark_inode_dirty ext4_dirty_inode __mark_inode_dirty
file_update_time do_wp_page handle_pte_fault handle_mm_fault


I imagine the problem is that the system is periodically writing out
my dirty pages and marking them clean (and hence write protected).
When I try to write to them, the kernel makes them writable again,
which causes latency either due to updating the inode mtime or because
the file is being written to disk when I try to write to it.

Is there any way to prevent this?  One possibility would be a way to
ask the kernel not to write the file out to disk.  Another would be a
way to ask the kernel to make a copy of the file when it writes it
disk and leave the original mapping writable.

Obviously I can fix this by mapping anonymous memory, but then I need
another thread to periodically write my logs out to disk, and if that
crashes, I lose data.

-- 
Andy Lutomirski
AMA Capital Management, LLC
Office: (310) 553-5322
Mobile: (650) 906-0647

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Latency writing to an mlocked ext4 mapping
@ 2011-10-20  0:39 ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  0:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-ext4

I have a real-time program that has everything mlocked (i.e.
mlockall(MCL_CURRENT | MCL_FUTURE)).  It has some log files opened for
writing.  Those files are opened and memset to zero in another thread
to fault everything in.  The system is under light I/O load with very
little memory pressure.

Latencytop shows frequent latency in the real-time threads.  The main
offenders are:

schedule sleep_on_page wait_on_page_bit ext4_page_mkwrite do_wp_page
handle_pte_fault handle_mm_fault do_page_fault page_fault

schedule do_get_write_access jbd2_journal_get_write_access
__ext4_journal_get_write_access ext4_reserve_inode_write
ext4_mark_inode_dirty ext4_dirty_inode __mark_inode_dirty
file_update_time do_wp_page handle_pte_fault handle_mm_fault


I imagine the problem is that the system is periodically writing out
my dirty pages and marking them clean (and hence write protected).
When I try to write to them, the kernel makes them writable again,
which causes latency either due to updating the inode mtime or because
the file is being written to disk when I try to write to it.

Is there any way to prevent this?  One possibility would be a way to
ask the kernel not to write the file out to disk.  Another would be a
way to ask the kernel to make a copy of the file when it writes it
disk and leave the original mapping writable.

Obviously I can fix this by mapping anonymous memory, but then I need
another thread to periodically write my logs out to disk, and if that
crashes, I lose data.

-- 
Andy Lutomirski
AMA Capital Management, LLC
Office: (310) 553-5322
Mobile: (650) 906-0647

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-20  0:39 ` Andy Lutomirski
@ 2011-10-20  1:02   ` Andreas Dilger
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2011-10-20  1:02 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, linux-mm, linux-ext4

What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data. 

We discussed doing copy-on-write, but there are relatively few mmap users and it wasn't clear whether the complexity was worth it. 

Cheers, Andreas

On 2011-10-19, at 6:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:

> I have a real-time program that has everything mlocked (i.e.
> mlockall(MCL_CURRENT | MCL_FUTURE)).  It has some log files opened for
> writing.  Those files are opened and memset to zero in another thread
> to fault everything in.  The system is under light I/O load with very
> little memory pressure.
> 
> Latencytop shows frequent latency in the real-time threads.  The main
> offenders are:
> 
> schedule sleep_on_page wait_on_page_bit ext4_page_mkwrite do_wp_page
> handle_pte_fault handle_mm_fault do_page_fault page_fault
> 
> schedule do_get_write_access jbd2_journal_get_write_access
> __ext4_journal_get_write_access ext4_reserve_inode_write
> ext4_mark_inode_dirty ext4_dirty_inode __mark_inode_dirty
> file_update_time do_wp_page handle_pte_fault handle_mm_fault
> 
> 
> I imagine the problem is that the system is periodically writing out
> my dirty pages and marking them clean (and hence write protected).
> When I try to write to them, the kernel makes them writable again,
> which causes latency either due to updating the inode mtime or because
> the file is being written to disk when I try to write to it.
> 
> Is there any way to prevent this?  One possibility would be a way to
> ask the kernel not to write the file out to disk.  Another would be a
> way to ask the kernel to make a copy of the file when it writes it
> disk and leave the original mapping writable.
> 
> Obviously I can fix this by mapping anonymous memory, but then I need
> another thread to periodically write my logs out to disk, and if that
> crashes, I lose data.
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> Office: (310) 553-5322
> Mobile: (650) 906-0647
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-20  1:02   ` Andreas Dilger
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2011-10-20  1:02 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, linux-mm, linux-ext4

What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data. 

We discussed doing copy-on-write, but there are relatively few mmap users and it wasn't clear whether the complexity was worth it. 

Cheers, Andreas

On 2011-10-19, at 6:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:

> I have a real-time program that has everything mlocked (i.e.
> mlockall(MCL_CURRENT | MCL_FUTURE)).  It has some log files opened for
> writing.  Those files are opened and memset to zero in another thread
> to fault everything in.  The system is under light I/O load with very
> little memory pressure.
> 
> Latencytop shows frequent latency in the real-time threads.  The main
> offenders are:
> 
> schedule sleep_on_page wait_on_page_bit ext4_page_mkwrite do_wp_page
> handle_pte_fault handle_mm_fault do_page_fault page_fault
> 
> schedule do_get_write_access jbd2_journal_get_write_access
> __ext4_journal_get_write_access ext4_reserve_inode_write
> ext4_mark_inode_dirty ext4_dirty_inode __mark_inode_dirty
> file_update_time do_wp_page handle_pte_fault handle_mm_fault
> 
> 
> I imagine the problem is that the system is periodically writing out
> my dirty pages and marking them clean (and hence write protected).
> When I try to write to them, the kernel makes them writable again,
> which causes latency either due to updating the inode mtime or because
> the file is being written to disk when I try to write to it.
> 
> Is there any way to prevent this?  One possibility would be a way to
> ask the kernel not to write the file out to disk.  Another would be a
> way to ask the kernel to make a copy of the file when it writes it
> disk and leave the original mapping writable.
> 
> Obviously I can fix this by mapping anonymous memory, but then I need
> another thread to periodically write my logs out to disk, and if that
> crashes, I lose data.
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> Office: (310) 553-5322
> Mobile: (650) 906-0647
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-20  1:02   ` Andreas Dilger
@ 2011-10-20  1:15     ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  1:15 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-mm, linux-ext4

On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.

3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
code that I'll submit sometime soon.)  Would this explain the latency
in file_update_time or is that a separate issue?  file_update_time
seems like a good thing to make fully asynchronous (especially if the
file in question is a fifo, but I've already moved my fifos to tmpfs).

>
> We discussed doing copy-on-write, but there are relatively few mmap users and it wasn't clear whether the complexity was worth it.

Hmm.  That might be nice, especially if the page is mlocked.

--Andy

>
> Cheers, Andreas
>
> On 2011-10-19, at 6:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>> I have a real-time program that has everything mlocked (i.e.
>> mlockall(MCL_CURRENT | MCL_FUTURE)).  It has some log files opened for
>> writing.  Those files are opened and memset to zero in another thread
>> to fault everything in.  The system is under light I/O load with very
>> little memory pressure.
>>
>> Latencytop shows frequent latency in the real-time threads.  The main
>> offenders are:
>>
>> schedule sleep_on_page wait_on_page_bit ext4_page_mkwrite do_wp_page
>> handle_pte_fault handle_mm_fault do_page_fault page_fault
>>
>> schedule do_get_write_access jbd2_journal_get_write_access
>> __ext4_journal_get_write_access ext4_reserve_inode_write
>> ext4_mark_inode_dirty ext4_dirty_inode __mark_inode_dirty
>> file_update_time do_wp_page handle_pte_fault handle_mm_fault
>>
>>
>> I imagine the problem is that the system is periodically writing out
>> my dirty pages and marking them clean (and hence write protected).
>> When I try to write to them, the kernel makes them writable again,
>> which causes latency either due to updating the inode mtime or because
>> the file is being written to disk when I try to write to it.
>>
>> Is there any way to prevent this?  One possibility would be a way to
>> ask the kernel not to write the file out to disk.  Another would be a
>> way to ask the kernel to make a copy of the file when it writes it
>> disk and leave the original mapping writable.
>>
>> Obviously I can fix this by mapping anonymous memory, but then I need
>> another thread to periodically write my logs out to disk, and if that
>> crashes, I lose data.
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>> Office: (310) 553-5322
>> Mobile: (650) 906-0647
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
Office: (310) 553-5322
Mobile: (650) 906-0647

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-20  1:15     ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  1:15 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-mm, linux-ext4

On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.

3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
code that I'll submit sometime soon.)  Would this explain the latency
in file_update_time or is that a separate issue?  file_update_time
seems like a good thing to make fully asynchronous (especially if the
file in question is a fifo, but I've already moved my fifos to tmpfs).

>
> We discussed doing copy-on-write, but there are relatively few mmap users and it wasn't clear whether the complexity was worth it.

Hmm.  That might be nice, especially if the page is mlocked.

--Andy

>
> Cheers, Andreas
>
> On 2011-10-19, at 6:39 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>> I have a real-time program that has everything mlocked (i.e.
>> mlockall(MCL_CURRENT | MCL_FUTURE)).  It has some log files opened for
>> writing.  Those files are opened and memset to zero in another thread
>> to fault everything in.  The system is under light I/O load with very
>> little memory pressure.
>>
>> Latencytop shows frequent latency in the real-time threads.  The main
>> offenders are:
>>
>> schedule sleep_on_page wait_on_page_bit ext4_page_mkwrite do_wp_page
>> handle_pte_fault handle_mm_fault do_page_fault page_fault
>>
>> schedule do_get_write_access jbd2_journal_get_write_access
>> __ext4_journal_get_write_access ext4_reserve_inode_write
>> ext4_mark_inode_dirty ext4_dirty_inode __mark_inode_dirty
>> file_update_time do_wp_page handle_pte_fault handle_mm_fault
>>
>>
>> I imagine the problem is that the system is periodically writing out
>> my dirty pages and marking them clean (and hence write protected).
>> When I try to write to them, the kernel makes them writable again,
>> which causes latency either due to updating the inode mtime or because
>> the file is being written to disk when I try to write to it.
>>
>> Is there any way to prevent this?  One possibility would be a way to
>> ask the kernel not to write the file out to disk.  Another would be a
>> way to ask the kernel to make a copy of the file when it writes it
>> disk and leave the original mapping writable.
>>
>> Obviously I can fix this by mapping anonymous memory, but then I need
>> another thread to periodically write my logs out to disk, and if that
>> crashes, I lose data.
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>> Office: (310) 553-5322
>> Mobile: (650) 906-0647
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
Office: (310) 553-5322
Mobile: (650) 906-0647

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-20  1:15     ` Andy Lutomirski
  (?)
@ 2011-10-20  2:17       ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  2:17 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-mm, linux-ext4

On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
>
> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
> code that I'll submit sometime soon.)  Would this explain the latency
> in file_update_time or is that a separate issue?  file_update_time
> seems like a good thing to make fully asynchronous (especially if the
> file in question is a fifo, but I've already moved my fifos to tmpfs).

On 2.6.39.4, I got one instance of:

call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
__block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
handle_pte_fault handle_mm_fault do_page_fault page_fault

but I'm not seeing the large numbers of the ext4_page_mkwrite trace
that I get on 3.0.6.  file_update_time is now by far the dominant
cause of latency.

I'll leave it running overnight and see what happens.


--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-20  2:17       ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  2:17 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-mm, linux-ext4

On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
>
> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
> code that I'll submit sometime soon.)  Would this explain the latency
> in file_update_time or is that a separate issue?  file_update_time
> seems like a good thing to make fully asynchronous (especially if the
> file in question is a fifo, but I've already moved my fifos to tmpfs).

On 2.6.39.4, I got one instance of:

call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
__block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
handle_pte_fault handle_mm_fault do_page_fault page_fault

but I'm not seeing the large numbers of the ext4_page_mkwrite trace
that I get on 3.0.6.  file_update_time is now by far the dominant
cause of latency.

I'll leave it running overnight and see what happens.


--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-20  2:17       ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  2:17 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-mm, linux-ext4

On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
>
> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
> code that I'll submit sometime soon.)  Would this explain the latency
> in file_update_time or is that a separate issue?  file_update_time
> seems like a good thing to make fully asynchronous (especially if the
> file in question is a fifo, but I've already moved my fifos to tmpfs).

On 2.6.39.4, I got one instance of:

call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
__block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
handle_pte_fault handle_mm_fault do_page_fault page_fault

but I'm not seeing the large numbers of the ext4_page_mkwrite trace
that I get on 3.0.6.  file_update_time is now by far the dominant
cause of latency.

I'll leave it running overnight and see what happens.


--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-20  2:17       ` Andy Lutomirski
  (?)
@ 2011-10-20  5:59         ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  5:59 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-mm, linux-ext4

On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
>>
>> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
>> code that I'll submit sometime soon.)  Would this explain the latency
>> in file_update_time or is that a separate issue?  file_update_time
>> seems like a good thing to make fully asynchronous (especially if the
>> file in question is a fifo, but I've already moved my fifos to tmpfs).
>
> On 2.6.39.4, I got one instance of:
>
> call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
> __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
> handle_pte_fault handle_mm_fault do_page_fault page_fault
>
> but I'm not seeing the large numbers of the ext4_page_mkwrite trace
> that I get on 3.0.6.  file_update_time is now by far the dominant
> cause of latency.

The culprit seems to be do_wp_page -> file_update_time ->
mark_inode_dirty_sync.  This surprises me for two reasons:

 - Why the _sync?  Are we worried that data will be written out before
the metadata?  If so, surely there's a better way than adding latency
here.

 - Why are we calling file_update_time at all?  Presumably we also
update the time when the page is written back (if not, that sounds
like a bug, since the contents may be changed after something saw the
mtime update), and, if so, why bother updating it on the first write?
Anything that relies on this behavior is, I think, unreliable, because
the page could be made writable arbitrarily early by another program
that changes nothing.

--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-20  5:59         ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  5:59 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-mm, linux-ext4

On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
>>
>> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
>> code that I'll submit sometime soon.)  Would this explain the latency
>> in file_update_time or is that a separate issue?  file_update_time
>> seems like a good thing to make fully asynchronous (especially if the
>> file in question is a fifo, but I've already moved my fifos to tmpfs).
>
> On 2.6.39.4, I got one instance of:
>
> call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
> __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
> handle_pte_fault handle_mm_fault do_page_fault page_fault
>
> but I'm not seeing the large numbers of the ext4_page_mkwrite trace
> that I get on 3.0.6.  file_update_time is now by far the dominant
> cause of latency.

The culprit seems to be do_wp_page -> file_update_time ->
mark_inode_dirty_sync.  This surprises me for two reasons:

 - Why the _sync?  Are we worried that data will be written out before
the metadata?  If so, surely there's a better way than adding latency
here.

 - Why are we calling file_update_time at all?  Presumably we also
update the time when the page is written back (if not, that sounds
like a bug, since the contents may be changed after something saw the
mtime update), and, if so, why bother updating it on the first write?
Anything that relies on this behavior is, I think, unreliable, because
the page could be made writable arbitrarily early by another program
that changes nothing.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-20  5:59         ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-20  5:59 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-mm, linux-ext4

On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
>>
>> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
>> code that I'll submit sometime soon.)  Would this explain the latency
>> in file_update_time or is that a separate issue?  file_update_time
>> seems like a good thing to make fully asynchronous (especially if the
>> file in question is a fifo, but I've already moved my fifos to tmpfs).
>
> On 2.6.39.4, I got one instance of:
>
> call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
> __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
> handle_pte_fault handle_mm_fault do_page_fault page_fault
>
> but I'm not seeing the large numbers of the ext4_page_mkwrite trace
> that I get on 3.0.6.  file_update_time is now by far the dominant
> cause of latency.

The culprit seems to be do_wp_page -> file_update_time ->
mark_inode_dirty_sync.  This surprises me for two reasons:

 - Why the _sync?  Are we worried that data will be written out before
the metadata?  If so, surely there's a better way than adding latency
here.

 - Why are we calling file_update_time at all?  Presumably we also
update the time when the page is written back (if not, that sounds
like a bug, since the contents may be changed after something saw the
mtime update), and, if so, why bother updating it on the first write?
Anything that relies on this behavior is, I think, unreliable, because
the page could be made writable arbitrarily early by another program
that changes nothing.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-20  5:59         ` Andy Lutomirski
  (?)
@ 2011-10-25 12:26           ` Jan Kara
  -1 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-10-25 12:26 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Wed 19-10-11 22:59:55, Andy Lutomirski wrote:
> On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> >>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
> >>
> >> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
> >> code that I'll submit sometime soon.)  Would this explain the latency
> >> in file_update_time or is that a separate issue?  file_update_time
> >> seems like a good thing to make fully asynchronous (especially if the
> >> file in question is a fifo, but I've already moved my fifos to tmpfs).
> >
> > On 2.6.39.4, I got one instance of:
> >
> > call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
> > __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
> > handle_pte_fault handle_mm_fault do_page_fault page_fault
> >
> > but I'm not seeing the large numbers of the ext4_page_mkwrite trace
> > that I get on 3.0.6.  file_update_time is now by far the dominant
> > cause of latency.
> 
> The culprit seems to be do_wp_page -> file_update_time ->
> mark_inode_dirty_sync.  This surprises me for two reasons:
> 
>  - Why the _sync?  Are we worried that data will be written out before
> the metadata?  If so, surely there's a better way than adding latency
> here.
  _sync just means that inode will become dirty for fsync(2) purposes but
not for fdatasync(2) purposes - i.e. it's just a timestamp update (or
it could be something similar).

>  - Why are we calling file_update_time at all?  Presumably we also
> update the time when the page is written back (if not, that sounds
> like a bug, since the contents may be changed after something saw the
> mtime update), and, if so, why bother updating it on the first write?
> Anything that relies on this behavior is, I think, unreliable, because
> the page could be made writable arbitrarily early by another program
> that changes nothing.
  We don't update timestamp when the page is written back. I believe this
is mostly because we don't know whether the data has been changed by a
write syscall, which already updated the timestamp, or by mmap. That is
also the reason why we update the timestamp at page fault time.

  The reason why file_update_time() blocks for you is probably that it
needs to get access to buffer where inode is stored on disk and because a
transaction including this buffer is committing at the moment, your thread
has to wait until the transaction commit finishes. This is mostly a problem
specific to how ext4 works so e.g. xfs shouldn't have it.

  Generally I believe the attempts to achieve any RT-like latencies when
writing to a filesystem are rather hopeless. How much hopeless depends on
the load of the filesystem (e.g., in your case of mostly idle filesystem I
can imagine some tweaks could reduce your latencies to an acceptable level
but once the disk gets loaded you'll be screwed). So I'd suggest that
having RT thread just store log in memory (or write to a pipe) and have
another non-RT thread write the data to disk would be a much more robust
design.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-25 12:26           ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-10-25 12:26 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Wed 19-10-11 22:59:55, Andy Lutomirski wrote:
> On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> >>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
> >>
> >> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
> >> code that I'll submit sometime soon.)  Would this explain the latency
> >> in file_update_time or is that a separate issue?  file_update_time
> >> seems like a good thing to make fully asynchronous (especially if the
> >> file in question is a fifo, but I've already moved my fifos to tmpfs).
> >
> > On 2.6.39.4, I got one instance of:
> >
> > call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
> > __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
> > handle_pte_fault handle_mm_fault do_page_fault page_fault
> >
> > but I'm not seeing the large numbers of the ext4_page_mkwrite trace
> > that I get on 3.0.6.  file_update_time is now by far the dominant
> > cause of latency.
> 
> The culprit seems to be do_wp_page -> file_update_time ->
> mark_inode_dirty_sync.  This surprises me for two reasons:
> 
>  - Why the _sync?  Are we worried that data will be written out before
> the metadata?  If so, surely there's a better way than adding latency
> here.
  _sync just means that inode will become dirty for fsync(2) purposes but
not for fdatasync(2) purposes - i.e. it's just a timestamp update (or
it could be something similar).

>  - Why are we calling file_update_time at all?  Presumably we also
> update the time when the page is written back (if not, that sounds
> like a bug, since the contents may be changed after something saw the
> mtime update), and, if so, why bother updating it on the first write?
> Anything that relies on this behavior is, I think, unreliable, because
> the page could be made writable arbitrarily early by another program
> that changes nothing.
  We don't update timestamp when the page is written back. I believe this
is mostly because we don't know whether the data has been changed by a
write syscall, which already updated the timestamp, or by mmap. That is
also the reason why we update the timestamp at page fault time.

  The reason why file_update_time() blocks for you is probably that it
needs to get access to buffer where inode is stored on disk and because a
transaction including this buffer is committing at the moment, your thread
has to wait until the transaction commit finishes. This is mostly a problem
specific to how ext4 works so e.g. xfs shouldn't have it.

  Generally I believe the attempts to achieve any RT-like latencies when
writing to a filesystem are rather hopeless. How much hopeless depends on
the load of the filesystem (e.g., in your case of mostly idle filesystem I
can imagine some tweaks could reduce your latencies to an acceptable level
but once the disk gets loaded you'll be screwed). So I'd suggest that
having RT thread just store log in memory (or write to a pipe) and have
another non-RT thread write the data to disk would be a much more robust
design.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-25 12:26           ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-10-25 12:26 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Wed 19-10-11 22:59:55, Andy Lutomirski wrote:
> On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> >>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
> >>
> >> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
> >> code that I'll submit sometime soon.)  Would this explain the latency
> >> in file_update_time or is that a separate issue?  file_update_time
> >> seems like a good thing to make fully asynchronous (especially if the
> >> file in question is a fifo, but I've already moved my fifos to tmpfs).
> >
> > On 2.6.39.4, I got one instance of:
> >
> > call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
> > __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
> > handle_pte_fault handle_mm_fault do_page_fault page_fault
> >
> > but I'm not seeing the large numbers of the ext4_page_mkwrite trace
> > that I get on 3.0.6.  file_update_time is now by far the dominant
> > cause of latency.
> 
> The culprit seems to be do_wp_page -> file_update_time ->
> mark_inode_dirty_sync.  This surprises me for two reasons:
> 
>  - Why the _sync?  Are we worried that data will be written out before
> the metadata?  If so, surely there's a better way than adding latency
> here.
  _sync just means that inode will become dirty for fsync(2) purposes but
not for fdatasync(2) purposes - i.e. it's just a timestamp update (or
it could be something similar).

>  - Why are we calling file_update_time at all?  Presumably we also
> update the time when the page is written back (if not, that sounds
> like a bug, since the contents may be changed after something saw the
> mtime update), and, if so, why bother updating it on the first write?
> Anything that relies on this behavior is, I think, unreliable, because
> the page could be made writable arbitrarily early by another program
> that changes nothing.
  We don't update timestamp when the page is written back. I believe this
is mostly because we don't know whether the data has been changed by a
write syscall, which already updated the timestamp, or by mmap. That is
also the reason why we update the timestamp at page fault time.

  The reason why file_update_time() blocks for you is probably that it
needs to get access to buffer where inode is stored on disk and because a
transaction including this buffer is committing at the moment, your thread
has to wait until the transaction commit finishes. This is mostly a problem
specific to how ext4 works so e.g. xfs shouldn't have it.

  Generally I believe the attempts to achieve any RT-like latencies when
writing to a filesystem are rather hopeless. How much hopeless depends on
the load of the filesystem (e.g., in your case of mostly idle filesystem I
can imagine some tweaks could reduce your latencies to an acceptable level
but once the disk gets loaded you'll be screwed). So I'd suggest that
having RT thread just store log in memory (or write to a pipe) and have
another non-RT thread write the data to disk would be a much more robust
design.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-25 12:26           ` Jan Kara
@ 2011-10-28 23:37             ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-28 23:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 19-10-11 22:59:55, Andy Lutomirski wrote:
>> On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> >>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
>> >>
>> >> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
>> >> code that I'll submit sometime soon.)  Would this explain the latency
>> >> in file_update_time or is that a separate issue?  file_update_time
>> >> seems like a good thing to make fully asynchronous (especially if the
>> >> file in question is a fifo, but I've already moved my fifos to tmpfs).
>> >
>> > On 2.6.39.4, I got one instance of:
>> >
>> > call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
>> > __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
>> > handle_pte_fault handle_mm_fault do_page_fault page_fault
>> >
>> > but I'm not seeing the large numbers of the ext4_page_mkwrite trace
>> > that I get on 3.0.6.  file_update_time is now by far the dominant
>> > cause of latency.
>>
>> The culprit seems to be do_wp_page -> file_update_time ->
>> mark_inode_dirty_sync.  This surprises me for two reasons:
>>
>>  - Why the _sync?  Are we worried that data will be written out before
>> the metadata?  If so, surely there's a better way than adding latency
>> here.
>  _sync just means that inode will become dirty for fsync(2) purposes but
> not for fdatasync(2) purposes - i.e. it's just a timestamp update (or
> it could be something similar).
>
>>  - Why are we calling file_update_time at all?  Presumably we also
>> update the time when the page is written back (if not, that sounds
>> like a bug, since the contents may be changed after something saw the
>> mtime update), and, if so, why bother updating it on the first write?
>> Anything that relies on this behavior is, I think, unreliable, because
>> the page could be made writable arbitrarily early by another program
>> that changes nothing.
>  We don't update timestamp when the page is written back. I believe this
> is mostly because we don't know whether the data has been changed by a
> write syscall, which already updated the timestamp, or by mmap. That is
> also the reason why we update the timestamp at page fault time.
>
>  The reason why file_update_time() blocks for you is probably that it
> needs to get access to buffer where inode is stored on disk and because a
> transaction including this buffer is committing at the moment, your thread
> has to wait until the transaction commit finishes. This is mostly a problem
> specific to how ext4 works so e.g. xfs shouldn't have it.
>
>  Generally I believe the attempts to achieve any RT-like latencies when
> writing to a filesystem are rather hopeless. How much hopeless depends on
> the load of the filesystem (e.g., in your case of mostly idle filesystem I
> can imagine some tweaks could reduce your latencies to an acceptable level
> but once the disk gets loaded you'll be screwed). So I'd suggest that
> having RT thread just store log in memory (or write to a pipe) and have
> another non-RT thread write the data to disk would be a much more robust
> design.

Windows seems to do pretty well at this, and I think it should be fixable on
Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
page_mkclean_one.  The fallout from that might be unpleasant, though, but
it would probably speed up a number of workloads.

Adding a whole separate process just to copy data from memory to disk sounds
a bit like a hack -- that's what mmap + mlock would do if it worked better.
Incidentally, pipes are no good.  I haven't root-caused it yet, but both reading
to and writing from pipes, even if O_NONBLOCK, can block.  I haven't root-caused
it yet.

Anyway, I'll start sending patches to whittle away at the problem,
starting right now :)

--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-28 23:37             ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-28 23:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 19-10-11 22:59:55, Andy Lutomirski wrote:
>> On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> >>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
>> >>
>> >> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
>> >> code that I'll submit sometime soon.)  Would this explain the latency
>> >> in file_update_time or is that a separate issue?  file_update_time
>> >> seems like a good thing to make fully asynchronous (especially if the
>> >> file in question is a fifo, but I've already moved my fifos to tmpfs).
>> >
>> > On 2.6.39.4, I got one instance of:
>> >
>> > call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
>> > __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
>> > handle_pte_fault handle_mm_fault do_page_fault page_fault
>> >
>> > but I'm not seeing the large numbers of the ext4_page_mkwrite trace
>> > that I get on 3.0.6.  file_update_time is now by far the dominant
>> > cause of latency.
>>
>> The culprit seems to be do_wp_page -> file_update_time ->
>> mark_inode_dirty_sync.  This surprises me for two reasons:
>>
>>  - Why the _sync?  Are we worried that data will be written out before
>> the metadata?  If so, surely there's a better way than adding latency
>> here.
>  _sync just means that inode will become dirty for fsync(2) purposes but
> not for fdatasync(2) purposes - i.e. it's just a timestamp update (or
> it could be something similar).
>
>>  - Why are we calling file_update_time at all?  Presumably we also
>> update the time when the page is written back (if not, that sounds
>> like a bug, since the contents may be changed after something saw the
>> mtime update), and, if so, why bother updating it on the first write?
>> Anything that relies on this behavior is, I think, unreliable, because
>> the page could be made writable arbitrarily early by another program
>> that changes nothing.
>  We don't update timestamp when the page is written back. I believe this
> is mostly because we don't know whether the data has been changed by a
> write syscall, which already updated the timestamp, or by mmap. That is
> also the reason why we update the timestamp at page fault time.
>
>  The reason why file_update_time() blocks for you is probably that it
> needs to get access to buffer where inode is stored on disk and because a
> transaction including this buffer is committing at the moment, your thread
> has to wait until the transaction commit finishes. This is mostly a problem
> specific to how ext4 works so e.g. xfs shouldn't have it.
>
>  Generally I believe the attempts to achieve any RT-like latencies when
> writing to a filesystem are rather hopeless. How much hopeless depends on
> the load of the filesystem (e.g., in your case of mostly idle filesystem I
> can imagine some tweaks could reduce your latencies to an acceptable level
> but once the disk gets loaded you'll be screwed). So I'd suggest that
> having RT thread just store log in memory (or write to a pipe) and have
> another non-RT thread write the data to disk would be a much more robust
> design.

Windows seems to do pretty well at this, and I think it should be fixable on
Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
page_mkclean_one.  The fallout from that might be unpleasant, though, but
it would probably speed up a number of workloads.

Adding a whole separate process just to copy data from memory to disk sounds
a bit like a hack -- that's what mmap + mlock would do if it worked better.
Incidentally, pipes are no good.  I haven't root-caused it yet, but both reading
to and writing from pipes, even if O_NONBLOCK, can block.  I haven't root-caused
it yet.

Anyway, I'll start sending patches to whittle away at the problem,
starting right now :)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] mm: Improve cmtime update on shared writable mmaps
  2011-10-28 23:37             ` Andy Lutomirski
@ 2011-10-28 23:39               ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-28 23:39 UTC (permalink / raw)
  To: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4
  Cc: Andy Lutomirski

We used to update a file's time on do_wp_page.  This caused latency
whenever file_update_time would sleep (this happens on ext4).  It is
also, IMO, less than ideal: any copy, backup, or 'make' run taken
after do_wp_page but before an mmap user finished writing would see
the new timestamp but the old contents.

With this patch, cmtime is updated after a page is written.  When the
mm code transfers the dirty bit from a pte to the associated struct
page, it also sets a new page flag indicating that the page was
modified directly from userspace.  The inode's time is then updated in
clear_page_dirty_for_io.

We can't update cmtime in all contexts in which ptes are unmapped:
various reclaim paths can unmap ptes from GFP_NOFS paths.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---

I'm not thrilled about using a page flag for this, but I haven't
spotted a better way.  Updating the time in writepage would touch
a lot of files and would interact oddly with write.

 fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
 include/linux/fs.h         |    1 +
 include/linux/page-flags.h |    5 ++++
 mm/page-writeback.c        |   19 +++++++++++++---
 mm/rmap.c                  |   18 +++++++++++++-
 5 files changed, 72 insertions(+), 22 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ec79246..ee93a25 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
 }
 EXPORT_SYMBOL(touch_atime);
 
-/**
- *	file_update_time	-	update mtime and ctime time
- *	@file: file accessed
- *
- *	Update the mtime and ctime members of an inode and mark the inode
- *	for writeback.  Note that this function is meant exclusively for
- *	usage in the file write path of filesystems, and filesystems may
- *	choose to explicitly ignore update via this function with the
- *	S_NOCMTIME inode flag, e.g. for network filesystem where these
- *	timestamps are handled by the server.
- */
-
-void file_update_time(struct file *file)
+static void do_inode_update_time(struct file *file, struct inode *inode)
 {
-	struct inode *inode = file->f_path.dentry->d_inode;
 	struct timespec now;
 	enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
 
@@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
 		return;
 
 	/* Finally allowed to write? Takes lock. */
-	if (mnt_want_write_file(file))
+	if (file && mnt_want_write_file(file))
 		return;
 
 	/* Only change inode inside the lock region */
@@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
 	if (sync_it & S_MTIME)
 		inode->i_mtime = now;
 	mark_inode_dirty_sync(inode);
-	mnt_drop_write(file->f_path.mnt);
+
+	if (file)
+		mnt_drop_write(file->f_path.mnt);
+}
+
+/**
+ *	file_update_time	-	update mtime and ctime time
+ *	@file: file accessed
+ *
+ *	Update the mtime and ctime members of an inode and mark the inode
+ *	for writeback.  Note that this function is meant exclusively for
+ *	usage in the file write path of filesystems, and filesystems may
+ *	choose to explicitly ignore update via this function with the
+ *	S_NOCMTIME inode flag, e.g. for network filesystem where these
+ *	timestamps are handled by the server.
+ */
+
+void file_update_time(struct file *file)
+{
+	do_inode_update_time(file, file->f_path.dentry->d_inode);
 }
 EXPORT_SYMBOL(file_update_time);
 
+/**
+ *	inode_update_time_writable	-	update mtime and ctime
+ *	@inode: inode accessed
+ *
+ *	Same as file_update_time, except that the caller is responsible
+ *	for checking that the mount is writable.
+ */
+
+void inode_update_time_writable(struct inode *inode)
+{
+	do_inode_update_time(0, inode);
+}
+
 int inode_needs_sync(struct inode *inode)
 {
 	if (IS_SYNC(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 277f497..9e28927 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
 extern void setattr_copy(struct inode *inode, const struct iattr *attr);
 
 extern void file_update_time(struct file *file);
+extern void inode_update_time_writable(struct inode *inode);
 
 extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
 extern void save_mount_options(struct super_block *sb, char *options);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e90a673..4eed012 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,6 +107,7 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+	PG_update_cmtime,	/* Dirtied via writable mapping. */
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+/* Whoever clears PG_update_cmtime must update the cmtime. */
+SETPAGEFLAG(UpdateCMTime, update_cmtime)
+TESTCLEARFLAG(UpdateCMTime, update_cmtime)
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0e309cd..41c48ea 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 
 /*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
+ * Returns true if the page was previously dirty.  Also updates inode time
+ * if necessary.
  *
  * This is for preparing to put the page under writeout.  We leave the page
  * tagged as dirty in the radix tree so that a concurrent write-for-sync
@@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  */
 int clear_page_dirty_for_io(struct page *page)
 {
+	int ret;
 	struct address_space *mapping = page_mapping(page);
 
 	BUG_ON(!PageLocked(page));
@@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
+			goto out;
 		}
-		return 0;
+		ret = 0;
+		goto out;
 	}
-	return TestClearPageDirty(page);
+	ret = TestClearPageDirty(page);
+
+out:
+	/* We know that the inode (if any) is on a writable mount. */
+	if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
+		inode_update_time_writable(mapping->host);
+
+	return ret;
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..2ee595d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
+
+			/*
+			 * If dirtied via shared writable mapping, cmtime
+			 * needs to be updated.  If dirtied only through
+			 * write(), etc, then the writer already updated
+			 * cmtime.
+			 */
+			if (ret)
+				SetPageUpdateCMTime(page);
+
 			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
 				ret = 1;
 		}
@@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval)) {
+		SetPageUpdateCMTime(page);
 		set_page_dirty(page);
+	}
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 			set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
-		if (pte_dirty(pteval))
+		if (pte_dirty(pteval)) {
+			SetPageUpdateCMTime(page);
 			set_page_dirty(page);
+		}
 
 		page_remove_rmap(page);
 		page_cache_release(page);
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH] mm: Improve cmtime update on shared writable mmaps
@ 2011-10-28 23:39               ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-28 23:39 UTC (permalink / raw)
  To: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4
  Cc: Andy Lutomirski

We used to update a file's time on do_wp_page.  This caused latency
whenever file_update_time would sleep (this happens on ext4).  It is
also, IMO, less than ideal: any copy, backup, or 'make' run taken
after do_wp_page but before an mmap user finished writing would see
the new timestamp but the old contents.

With this patch, cmtime is updated after a page is written.  When the
mm code transfers the dirty bit from a pte to the associated struct
page, it also sets a new page flag indicating that the page was
modified directly from userspace.  The inode's time is then updated in
clear_page_dirty_for_io.

We can't update cmtime in all contexts in which ptes are unmapped:
various reclaim paths can unmap ptes from GFP_NOFS paths.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---

I'm not thrilled about using a page flag for this, but I haven't
spotted a better way.  Updating the time in writepage would touch
a lot of files and would interact oddly with write.

 fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
 include/linux/fs.h         |    1 +
 include/linux/page-flags.h |    5 ++++
 mm/page-writeback.c        |   19 +++++++++++++---
 mm/rmap.c                  |   18 +++++++++++++-
 5 files changed, 72 insertions(+), 22 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ec79246..ee93a25 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
 }
 EXPORT_SYMBOL(touch_atime);
 
-/**
- *	file_update_time	-	update mtime and ctime time
- *	@file: file accessed
- *
- *	Update the mtime and ctime members of an inode and mark the inode
- *	for writeback.  Note that this function is meant exclusively for
- *	usage in the file write path of filesystems, and filesystems may
- *	choose to explicitly ignore update via this function with the
- *	S_NOCMTIME inode flag, e.g. for network filesystem where these
- *	timestamps are handled by the server.
- */
-
-void file_update_time(struct file *file)
+static void do_inode_update_time(struct file *file, struct inode *inode)
 {
-	struct inode *inode = file->f_path.dentry->d_inode;
 	struct timespec now;
 	enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
 
@@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
 		return;
 
 	/* Finally allowed to write? Takes lock. */
-	if (mnt_want_write_file(file))
+	if (file && mnt_want_write_file(file))
 		return;
 
 	/* Only change inode inside the lock region */
@@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
 	if (sync_it & S_MTIME)
 		inode->i_mtime = now;
 	mark_inode_dirty_sync(inode);
-	mnt_drop_write(file->f_path.mnt);
+
+	if (file)
+		mnt_drop_write(file->f_path.mnt);
+}
+
+/**
+ *	file_update_time	-	update mtime and ctime time
+ *	@file: file accessed
+ *
+ *	Update the mtime and ctime members of an inode and mark the inode
+ *	for writeback.  Note that this function is meant exclusively for
+ *	usage in the file write path of filesystems, and filesystems may
+ *	choose to explicitly ignore update via this function with the
+ *	S_NOCMTIME inode flag, e.g. for network filesystem where these
+ *	timestamps are handled by the server.
+ */
+
+void file_update_time(struct file *file)
+{
+	do_inode_update_time(file, file->f_path.dentry->d_inode);
 }
 EXPORT_SYMBOL(file_update_time);
 
+/**
+ *	inode_update_time_writable	-	update mtime and ctime
+ *	@inode: inode accessed
+ *
+ *	Same as file_update_time, except that the caller is responsible
+ *	for checking that the mount is writable.
+ */
+
+void inode_update_time_writable(struct inode *inode)
+{
+	do_inode_update_time(0, inode);
+}
+
 int inode_needs_sync(struct inode *inode)
 {
 	if (IS_SYNC(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 277f497..9e28927 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
 extern void setattr_copy(struct inode *inode, const struct iattr *attr);
 
 extern void file_update_time(struct file *file);
+extern void inode_update_time_writable(struct inode *inode);
 
 extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
 extern void save_mount_options(struct super_block *sb, char *options);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e90a673..4eed012 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,6 +107,7 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+	PG_update_cmtime,	/* Dirtied via writable mapping. */
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+/* Whoever clears PG_update_cmtime must update the cmtime. */
+SETPAGEFLAG(UpdateCMTime, update_cmtime)
+TESTCLEARFLAG(UpdateCMTime, update_cmtime)
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0e309cd..41c48ea 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 
 /*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
+ * Returns true if the page was previously dirty.  Also updates inode time
+ * if necessary.
  *
  * This is for preparing to put the page under writeout.  We leave the page
  * tagged as dirty in the radix tree so that a concurrent write-for-sync
@@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  */
 int clear_page_dirty_for_io(struct page *page)
 {
+	int ret;
 	struct address_space *mapping = page_mapping(page);
 
 	BUG_ON(!PageLocked(page));
@@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
+			goto out;
 		}
-		return 0;
+		ret = 0;
+		goto out;
 	}
-	return TestClearPageDirty(page);
+	ret = TestClearPageDirty(page);
+
+out:
+	/* We know that the inode (if any) is on a writable mount. */
+	if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
+		inode_update_time_writable(mapping->host);
+
+	return ret;
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..2ee595d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
+
+			/*
+			 * If dirtied via shared writable mapping, cmtime
+			 * needs to be updated.  If dirtied only through
+			 * write(), etc, then the writer already updated
+			 * cmtime.
+			 */
+			if (ret)
+				SetPageUpdateCMTime(page);
+
 			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
 				ret = 1;
 		}
@@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval)) {
+		SetPageUpdateCMTime(page);
 		set_page_dirty(page);
+	}
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 			set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
-		if (pte_dirty(pteval))
+		if (pte_dirty(pteval)) {
+			SetPageUpdateCMTime(page);
 			set_page_dirty(page);
+		}
 
 		page_remove_rmap(page);
 		page_cache_release(page);
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-28 23:37             ` Andy Lutomirski
  (?)
@ 2011-10-31 23:10               ` Jan Kara
  -1 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-10-31 23:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
> >>  - Why are we calling file_update_time at all?  Presumably we also
> >> update the time when the page is written back (if not, that sounds
> >> like a bug, since the contents may be changed after something saw the
> >> mtime update), and, if so, why bother updating it on the first write?
> >> Anything that relies on this behavior is, I think, unreliable, because
> >> the page could be made writable arbitrarily early by another program
> >> that changes nothing.
> >  We don't update timestamp when the page is written back. I believe this
> > is mostly because we don't know whether the data has been changed by a
> > write syscall, which already updated the timestamp, or by mmap. That is
> > also the reason why we update the timestamp at page fault time.
> >
> >  The reason why file_update_time() blocks for you is probably that it
> > needs to get access to buffer where inode is stored on disk and because a
> > transaction including this buffer is committing at the moment, your thread
> > has to wait until the transaction commit finishes. This is mostly a problem
> > specific to how ext4 works so e.g. xfs shouldn't have it.
> >
> >  Generally I believe the attempts to achieve any RT-like latencies when
> > writing to a filesystem are rather hopeless. How much hopeless depends on
> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
> > can imagine some tweaks could reduce your latencies to an acceptable level
> > but once the disk gets loaded you'll be screwed). So I'd suggest that
> > having RT thread just store log in memory (or write to a pipe) and have
> > another non-RT thread write the data to disk would be a much more robust
> > design.
> 
> Windows seems to do pretty well at this, and I think it should be fixable on
> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
> page_mkclean_one.  The fallout from that might be unpleasant, though, but
> it would probably speed up a number of workloads.
  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
unlikely to go away in a forseeable future. The reason is that we need to
reliably account the number of dirty pages so that we can throttle
processes that dirty too much of memory and also protect agaist system
going into out-of-memory problems when too many pages would be dirty (and
thus hard to reclaim). Thus we create clean pages as write-protected, when
they are first written to, we account them as dirtied and unprotect them.
When pages are cleaned by writeback, we decrement number of dirty pages
accordingly and write-protect them again. 
 
> Adding a whole separate process just to copy data from memory to disk sounds
> a bit like a hack -- that's what mmap + mlock would do if it worked better.
  Well, always only guarantees you cannot hit major fault when accessing
the page. And we keep that promise - we only hit a minor fault. But I agree
that for your usecase this is impractical.

I can see as theoretically feasible for writeback to skip mlocked pages
which would help your case. But practically, I do not see how to implement
that efficiently (just skipping a dirty page when we find it's mlocked
seems like a way to waste CPU needlessly).

> Incidentally, pipes are no good.  I haven't root-caused it yet, but both
> reading to and writing from pipes, even if O_NONBLOCK, can block.  I
> haven't root-caused it yet.
  Interesting. I imagine they could block on memory allocation but I guess
you don't put that much pressure on your system. So it might be interesting
to know where else they block...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-31 23:10               ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-10-31 23:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
> >>  - Why are we calling file_update_time at all?  Presumably we also
> >> update the time when the page is written back (if not, that sounds
> >> like a bug, since the contents may be changed after something saw the
> >> mtime update), and, if so, why bother updating it on the first write?
> >> Anything that relies on this behavior is, I think, unreliable, because
> >> the page could be made writable arbitrarily early by another program
> >> that changes nothing.
> >  We don't update timestamp when the page is written back. I believe this
> > is mostly because we don't know whether the data has been changed by a
> > write syscall, which already updated the timestamp, or by mmap. That is
> > also the reason why we update the timestamp at page fault time.
> >
> >  The reason why file_update_time() blocks for you is probably that it
> > needs to get access to buffer where inode is stored on disk and because a
> > transaction including this buffer is committing at the moment, your thread
> > has to wait until the transaction commit finishes. This is mostly a problem
> > specific to how ext4 works so e.g. xfs shouldn't have it.
> >
> >  Generally I believe the attempts to achieve any RT-like latencies when
> > writing to a filesystem are rather hopeless. How much hopeless depends on
> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
> > can imagine some tweaks could reduce your latencies to an acceptable level
> > but once the disk gets loaded you'll be screwed). So I'd suggest that
> > having RT thread just store log in memory (or write to a pipe) and have
> > another non-RT thread write the data to disk would be a much more robust
> > design.
> 
> Windows seems to do pretty well at this, and I think it should be fixable on
> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
> page_mkclean_one.  The fallout from that might be unpleasant, though, but
> it would probably speed up a number of workloads.
  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
unlikely to go away in a forseeable future. The reason is that we need to
reliably account the number of dirty pages so that we can throttle
processes that dirty too much of memory and also protect agaist system
going into out-of-memory problems when too many pages would be dirty (and
thus hard to reclaim). Thus we create clean pages as write-protected, when
they are first written to, we account them as dirtied and unprotect them.
When pages are cleaned by writeback, we decrement number of dirty pages
accordingly and write-protect them again. 
 
> Adding a whole separate process just to copy data from memory to disk sounds
> a bit like a hack -- that's what mmap + mlock would do if it worked better.
  Well, always only guarantees you cannot hit major fault when accessing
the page. And we keep that promise - we only hit a minor fault. But I agree
that for your usecase this is impractical.

I can see as theoretically feasible for writeback to skip mlocked pages
which would help your case. But practically, I do not see how to implement
that efficiently (just skipping a dirty page when we find it's mlocked
seems like a way to waste CPU needlessly).

> Incidentally, pipes are no good.  I haven't root-caused it yet, but both
> reading to and writing from pipes, even if O_NONBLOCK, can block.  I
> haven't root-caused it yet.
  Interesting. I imagine they could block on memory allocation but I guess
you don't put that much pressure on your system. So it might be interesting
to know where else they block...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-31 23:10               ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-10-31 23:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
> >>  - Why are we calling file_update_time at all?  Presumably we also
> >> update the time when the page is written back (if not, that sounds
> >> like a bug, since the contents may be changed after something saw the
> >> mtime update), and, if so, why bother updating it on the first write?
> >> Anything that relies on this behavior is, I think, unreliable, because
> >> the page could be made writable arbitrarily early by another program
> >> that changes nothing.
> >  We don't update timestamp when the page is written back. I believe this
> > is mostly because we don't know whether the data has been changed by a
> > write syscall, which already updated the timestamp, or by mmap. That is
> > also the reason why we update the timestamp at page fault time.
> >
> >  The reason why file_update_time() blocks for you is probably that it
> > needs to get access to buffer where inode is stored on disk and because a
> > transaction including this buffer is committing at the moment, your thread
> > has to wait until the transaction commit finishes. This is mostly a problem
> > specific to how ext4 works so e.g. xfs shouldn't have it.
> >
> >  Generally I believe the attempts to achieve any RT-like latencies when
> > writing to a filesystem are rather hopeless. How much hopeless depends on
> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
> > can imagine some tweaks could reduce your latencies to an acceptable level
> > but once the disk gets loaded you'll be screwed). So I'd suggest that
> > having RT thread just store log in memory (or write to a pipe) and have
> > another non-RT thread write the data to disk would be a much more robust
> > design.
> 
> Windows seems to do pretty well at this, and I think it should be fixable on
> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
> page_mkclean_one.  The fallout from that might be unpleasant, though, but
> it would probably speed up a number of workloads.
  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
unlikely to go away in a forseeable future. The reason is that we need to
reliably account the number of dirty pages so that we can throttle
processes that dirty too much of memory and also protect agaist system
going into out-of-memory problems when too many pages would be dirty (and
thus hard to reclaim). Thus we create clean pages as write-protected, when
they are first written to, we account them as dirtied and unprotect them.
When pages are cleaned by writeback, we decrement number of dirty pages
accordingly and write-protect them again. 
 
> Adding a whole separate process just to copy data from memory to disk sounds
> a bit like a hack -- that's what mmap + mlock would do if it worked better.
  Well, always only guarantees you cannot hit major fault when accessing
the page. And we keep that promise - we only hit a minor fault. But I agree
that for your usecase this is impractical.

I can see as theoretically feasible for writeback to skip mlocked pages
which would help your case. But practically, I do not see how to implement
that efficiently (just skipping a dirty page when we find it's mlocked
seems like a way to waste CPU needlessly).

> Incidentally, pipes are no good.  I haven't root-caused it yet, but both
> reading to and writing from pipes, even if O_NONBLOCK, can block.  I
> haven't root-caused it yet.
  Interesting. I imagine they could block on memory allocation but I guess
you don't put that much pressure on your system. So it might be interesting
to know where else they block...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-31 23:10               ` Jan Kara
  (?)
@ 2011-10-31 23:14                 ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-31 23:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
>> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
>> >>  - Why are we calling file_update_time at all?  Presumably we also
>> >> update the time when the page is written back (if not, that sounds
>> >> like a bug, since the contents may be changed after something saw the
>> >> mtime update), and, if so, why bother updating it on the first write?
>> >> Anything that relies on this behavior is, I think, unreliable, because
>> >> the page could be made writable arbitrarily early by another program
>> >> that changes nothing.
>> >  We don't update timestamp when the page is written back. I believe this
>> > is mostly because we don't know whether the data has been changed by a
>> > write syscall, which already updated the timestamp, or by mmap. That is
>> > also the reason why we update the timestamp at page fault time.
>> >
>> >  The reason why file_update_time() blocks for you is probably that it
>> > needs to get access to buffer where inode is stored on disk and because a
>> > transaction including this buffer is committing at the moment, your thread
>> > has to wait until the transaction commit finishes. This is mostly a problem
>> > specific to how ext4 works so e.g. xfs shouldn't have it.
>> >
>> >  Generally I believe the attempts to achieve any RT-like latencies when
>> > writing to a filesystem are rather hopeless. How much hopeless depends on
>> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
>> > can imagine some tweaks could reduce your latencies to an acceptable level
>> > but once the disk gets loaded you'll be screwed). So I'd suggest that
>> > having RT thread just store log in memory (or write to a pipe) and have
>> > another non-RT thread write the data to disk would be a much more robust
>> > design.
>>
>> Windows seems to do pretty well at this, and I think it should be fixable on
>> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
>> page_mkclean_one.  The fallout from that might be unpleasant, though, but
>> it would probably speed up a number of workloads.
>  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
> unlikely to go away in a forseeable future. The reason is that we need to
> reliably account the number of dirty pages so that we can throttle
> processes that dirty too much of memory and also protect agaist system
> going into out-of-memory problems when too many pages would be dirty (and
> thus hard to reclaim). Thus we create clean pages as write-protected, when
> they are first written to, we account them as dirtied and unprotect them.
> When pages are cleaned by writeback, we decrement number of dirty pages
> accordingly and write-protect them again.

What about skipping pte_wrprotect for mlocked pages and continuing to
account them dirty even if they're actually clean?  This should be a
straightforward patch except for the effect on stable pages for
writeback.  (It would also have unfortunate side effects on
ctime/mtime without my other patch to rearrange that code.)

>
>> Adding a whole separate process just to copy data from memory to disk sounds
>> a bit like a hack -- that's what mmap + mlock would do if it worked better.
>  Well, always only guarantees you cannot hit major fault when accessing
> the page. And we keep that promise - we only hit a minor fault. But I agree
> that for your usecase this is impractical.

Not really true.  We never fault in the page, but make_write can wait
for I/O (for hundreds of ms) which is just as bad.

>
> I can see as theoretically feasible for writeback to skip mlocked pages
> which would help your case. But practically, I do not see how to implement
> that efficiently (just skipping a dirty page when we find it's mlocked
> seems like a way to waste CPU needlessly).
>
>> Incidentally, pipes are no good.  I haven't root-caused it yet, but both
>> reading to and writing from pipes, even if O_NONBLOCK, can block.  I
>> haven't root-caused it yet.
>  Interesting. I imagine they could block on memory allocation but I guess
> you don't put that much pressure on your system. So it might be interesting
> to know where else they block...

I'll figure it out in a couple of days, I imagine.

--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-31 23:14                 ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-31 23:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
>> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
>> >>  - Why are we calling file_update_time at all?  Presumably we also
>> >> update the time when the page is written back (if not, that sounds
>> >> like a bug, since the contents may be changed after something saw the
>> >> mtime update), and, if so, why bother updating it on the first write?
>> >> Anything that relies on this behavior is, I think, unreliable, because
>> >> the page could be made writable arbitrarily early by another program
>> >> that changes nothing.
>> >  We don't update timestamp when the page is written back. I believe this
>> > is mostly because we don't know whether the data has been changed by a
>> > write syscall, which already updated the timestamp, or by mmap. That is
>> > also the reason why we update the timestamp at page fault time.
>> >
>> >  The reason why file_update_time() blocks for you is probably that it
>> > needs to get access to buffer where inode is stored on disk and because a
>> > transaction including this buffer is committing at the moment, your thread
>> > has to wait until the transaction commit finishes. This is mostly a problem
>> > specific to how ext4 works so e.g. xfs shouldn't have it.
>> >
>> >  Generally I believe the attempts to achieve any RT-like latencies when
>> > writing to a filesystem are rather hopeless. How much hopeless depends on
>> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
>> > can imagine some tweaks could reduce your latencies to an acceptable level
>> > but once the disk gets loaded you'll be screwed). So I'd suggest that
>> > having RT thread just store log in memory (or write to a pipe) and have
>> > another non-RT thread write the data to disk would be a much more robust
>> > design.
>>
>> Windows seems to do pretty well at this, and I think it should be fixable on
>> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
>> page_mkclean_one.  The fallout from that might be unpleasant, though, but
>> it would probably speed up a number of workloads.
>  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
> unlikely to go away in a forseeable future. The reason is that we need to
> reliably account the number of dirty pages so that we can throttle
> processes that dirty too much of memory and also protect agaist system
> going into out-of-memory problems when too many pages would be dirty (and
> thus hard to reclaim). Thus we create clean pages as write-protected, when
> they are first written to, we account them as dirtied and unprotect them.
> When pages are cleaned by writeback, we decrement number of dirty pages
> accordingly and write-protect them again.

What about skipping pte_wrprotect for mlocked pages and continuing to
account them dirty even if they're actually clean?  This should be a
straightforward patch except for the effect on stable pages for
writeback.  (It would also have unfortunate side effects on
ctime/mtime without my other patch to rearrange that code.)

>
>> Adding a whole separate process just to copy data from memory to disk sounds
>> a bit like a hack -- that's what mmap + mlock would do if it worked better.
>  Well, always only guarantees you cannot hit major fault when accessing
> the page. And we keep that promise - we only hit a minor fault. But I agree
> that for your usecase this is impractical.

Not really true.  We never fault in the page, but make_write can wait
for I/O (for hundreds of ms) which is just as bad.

>
> I can see as theoretically feasible for writeback to skip mlocked pages
> which would help your case. But practically, I do not see how to implement
> that efficiently (just skipping a dirty page when we find it's mlocked
> seems like a way to waste CPU needlessly).
>
>> Incidentally, pipes are no good.  I haven't root-caused it yet, but both
>> reading to and writing from pipes, even if O_NONBLOCK, can block.  I
>> haven't root-caused it yet.
>  Interesting. I imagine they could block on memory allocation but I guess
> you don't put that much pressure on your system. So it might be interesting
> to know where else they block...

I'll figure it out in a couple of days, I imagine.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-10-31 23:14                 ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-10-31 23:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
>> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
>> >>  - Why are we calling file_update_time at all?  Presumably we also
>> >> update the time when the page is written back (if not, that sounds
>> >> like a bug, since the contents may be changed after something saw the
>> >> mtime update), and, if so, why bother updating it on the first write?
>> >> Anything that relies on this behavior is, I think, unreliable, because
>> >> the page could be made writable arbitrarily early by another program
>> >> that changes nothing.
>> >  We don't update timestamp when the page is written back. I believe this
>> > is mostly because we don't know whether the data has been changed by a
>> > write syscall, which already updated the timestamp, or by mmap. That is
>> > also the reason why we update the timestamp at page fault time.
>> >
>> >  The reason why file_update_time() blocks for you is probably that it
>> > needs to get access to buffer where inode is stored on disk and because a
>> > transaction including this buffer is committing at the moment, your thread
>> > has to wait until the transaction commit finishes. This is mostly a problem
>> > specific to how ext4 works so e.g. xfs shouldn't have it.
>> >
>> >  Generally I believe the attempts to achieve any RT-like latencies when
>> > writing to a filesystem are rather hopeless. How much hopeless depends on
>> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
>> > can imagine some tweaks could reduce your latencies to an acceptable level
>> > but once the disk gets loaded you'll be screwed). So I'd suggest that
>> > having RT thread just store log in memory (or write to a pipe) and have
>> > another non-RT thread write the data to disk would be a much more robust
>> > design.
>>
>> Windows seems to do pretty well at this, and I think it should be fixable on
>> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
>> page_mkclean_one.  The fallout from that might be unpleasant, though, but
>> it would probably speed up a number of workloads.
>  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
> unlikely to go away in a forseeable future. The reason is that we need to
> reliably account the number of dirty pages so that we can throttle
> processes that dirty too much of memory and also protect agaist system
> going into out-of-memory problems when too many pages would be dirty (and
> thus hard to reclaim). Thus we create clean pages as write-protected, when
> they are first written to, we account them as dirtied and unprotect them.
> When pages are cleaned by writeback, we decrement number of dirty pages
> accordingly and write-protect them again.

What about skipping pte_wrprotect for mlocked pages and continuing to
account them dirty even if they're actually clean?  This should be a
straightforward patch except for the effect on stable pages for
writeback.  (It would also have unfortunate side effects on
ctime/mtime without my other patch to rearrange that code.)

>
>> Adding a whole separate process just to copy data from memory to disk sounds
>> a bit like a hack -- that's what mmap + mlock would do if it worked better.
>  Well, always only guarantees you cannot hit major fault when accessing
> the page. And we keep that promise - we only hit a minor fault. But I agree
> that for your usecase this is impractical.

Not really true.  We never fault in the page, but make_write can wait
for I/O (for hundreds of ms) which is just as bad.

>
> I can see as theoretically feasible for writeback to skip mlocked pages
> which would help your case. But practically, I do not see how to implement
> that efficiently (just skipping a dirty page when we find it's mlocked
> seems like a way to waste CPU needlessly).
>
>> Incidentally, pipes are no good.  I haven't root-caused it yet, but both
>> reading to and writing from pipes, even if O_NONBLOCK, can block.  I
>> haven't root-caused it yet.
>  Interesting. I imagine they could block on memory allocation but I guess
> you don't put that much pressure on your system. So it might be interesting
> to know where else they block...

I'll figure it out in a couple of days, I imagine.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
  2011-10-28 23:39               ` Andy Lutomirski
@ 2011-11-01 22:53                 ` Jan Kara
  -1 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-01 22:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4

On Fri 28-10-11 16:39:25, Andy Lutomirski wrote:
> We used to update a file's time on do_wp_page.  This caused latency
> whenever file_update_time would sleep (this happens on ext4).  It is
> also, IMO, less than ideal: any copy, backup, or 'make' run taken
> after do_wp_page but before an mmap user finished writing would see
> the new timestamp but the old contents.
> 
> With this patch, cmtime is updated after a page is written.  When the
> mm code transfers the dirty bit from a pte to the associated struct
> page, it also sets a new page flag indicating that the page was
> modified directly from userspace.  The inode's time is then updated in
> clear_page_dirty_for_io.
> 
> We can't update cmtime in all contexts in which ptes are unmapped:
> various reclaim paths can unmap ptes from GFP_NOFS paths.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
> 
> I'm not thrilled about using a page flag for this, but I haven't
> spotted a better way.  Updating the time in writepage would touch
> a lot of files and would interact oddly with write.
  I see two problems with this patch:
1) Using a page flags is really a no-go. We are rather short on page flags
so using them for such minor thing is a real wastage. Moreover it should be
rather easy to just use an inode flag instead.

2) You cannot call inode_update_time_writable() from
clear_page_dirty_for_io() because that is called under a page lock and thus
would create lock inversion problems.

								Honza
> 
>  fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
>  include/linux/fs.h         |    1 +
>  include/linux/page-flags.h |    5 ++++
>  mm/page-writeback.c        |   19 +++++++++++++---
>  mm/rmap.c                  |   18 +++++++++++++-
>  5 files changed, 72 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index ec79246..ee93a25 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
>  }
>  EXPORT_SYMBOL(touch_atime);
>  
> -/**
> - *	file_update_time	-	update mtime and ctime time
> - *	@file: file accessed
> - *
> - *	Update the mtime and ctime members of an inode and mark the inode
> - *	for writeback.  Note that this function is meant exclusively for
> - *	usage in the file write path of filesystems, and filesystems may
> - *	choose to explicitly ignore update via this function with the
> - *	S_NOCMTIME inode flag, e.g. for network filesystem where these
> - *	timestamps are handled by the server.
> - */
> -
> -void file_update_time(struct file *file)
> +static void do_inode_update_time(struct file *file, struct inode *inode)
>  {
> -	struct inode *inode = file->f_path.dentry->d_inode;
>  	struct timespec now;
>  	enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
>  
> @@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
>  		return;
>  
>  	/* Finally allowed to write? Takes lock. */
> -	if (mnt_want_write_file(file))
> +	if (file && mnt_want_write_file(file))
>  		return;
>  
>  	/* Only change inode inside the lock region */
> @@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
>  	if (sync_it & S_MTIME)
>  		inode->i_mtime = now;
>  	mark_inode_dirty_sync(inode);
> -	mnt_drop_write(file->f_path.mnt);
> +
> +	if (file)
> +		mnt_drop_write(file->f_path.mnt);
> +}
> +
> +/**
> + *	file_update_time	-	update mtime and ctime time
> + *	@file: file accessed
> + *
> + *	Update the mtime and ctime members of an inode and mark the inode
> + *	for writeback.  Note that this function is meant exclusively for
> + *	usage in the file write path of filesystems, and filesystems may
> + *	choose to explicitly ignore update via this function with the
> + *	S_NOCMTIME inode flag, e.g. for network filesystem where these
> + *	timestamps are handled by the server.
> + */
> +
> +void file_update_time(struct file *file)
> +{
> +	do_inode_update_time(file, file->f_path.dentry->d_inode);
>  }
>  EXPORT_SYMBOL(file_update_time);
>  
> +/**
> + *	inode_update_time_writable	-	update mtime and ctime
> + *	@inode: inode accessed
> + *
> + *	Same as file_update_time, except that the caller is responsible
> + *	for checking that the mount is writable.
> + */
> +
> +void inode_update_time_writable(struct inode *inode)
> +{
> +	do_inode_update_time(0, inode);
> +}
> +
>  int inode_needs_sync(struct inode *inode)
>  {
>  	if (IS_SYNC(inode))
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 277f497..9e28927 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
>  extern void setattr_copy(struct inode *inode, const struct iattr *attr);
>  
>  extern void file_update_time(struct file *file);
> +extern void inode_update_time_writable(struct inode *inode);
>  
>  extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
>  extern void save_mount_options(struct super_block *sb, char *options);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e90a673..4eed012 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -107,6 +107,7 @@ enum pageflags {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	PG_compound_lock,
>  #endif
> +	PG_update_cmtime,	/* Dirtied via writable mapping. */
>  	__NR_PAGEFLAGS,
>  
>  	/* Filesystems */
> @@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
>  #define __PG_HWPOISON 0
>  #endif
>  
> +/* Whoever clears PG_update_cmtime must update the cmtime. */
> +SETPAGEFLAG(UpdateCMTime, update_cmtime)
> +TESTCLEARFLAG(UpdateCMTime, update_cmtime)
> +
>  u64 stable_page_flags(struct page *page);
>  
>  static inline int PageUptodate(struct page *page)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0e309cd..41c48ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  
>  /*
>   * Clear a page's dirty flag, while caring for dirty memory accounting.
> - * Returns true if the page was previously dirty.
> + * Returns true if the page was previously dirty.  Also updates inode time
> + * if necessary.
>   *
>   * This is for preparing to put the page under writeout.  We leave the page
>   * tagged as dirty in the radix tree so that a concurrent write-for-sync
> @@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>   */
>  int clear_page_dirty_for_io(struct page *page)
>  {
> +	int ret;
>  	struct address_space *mapping = page_mapping(page);
>  
>  	BUG_ON(!PageLocked(page));
> @@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
> +			goto out;
>  		}
> -		return 0;
> +		ret = 0;
> +		goto out;
>  	}
> -	return TestClearPageDirty(page);
> +	ret = TestClearPageDirty(page);
> +
> +out:
> +	/* We know that the inode (if any) is on a writable mount. */
> +	if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
> +		inode_update_time_writable(mapping->host);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL(clear_page_dirty_for_io);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..2ee595d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
>  		struct address_space *mapping = page_mapping(page);
>  		if (mapping) {
>  			ret = page_mkclean_file(mapping, page);
> +
> +			/*
> +			 * If dirtied via shared writable mapping, cmtime
> +			 * needs to be updated.  If dirtied only through
> +			 * write(), etc, then the writer already updated
> +			 * cmtime.
> +			 */
> +			if (ret)
> +				SetPageUpdateCMTime(page);
> +
>  			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
>  				ret = 1;
>  		}
> @@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pteval = ptep_clear_flush_notify(vma, address, pte);
>  
>  	/* Move the dirty bit to the physical page now the pte is gone. */
> -	if (pte_dirty(pteval))
> +	if (pte_dirty(pteval)) {
> +		SetPageUpdateCMTime(page);
>  		set_page_dirty(page);
> +	}
>  
>  	/* Update high watermark before we lower rss */
>  	update_hiwater_rss(mm);
> @@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  			set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
>  
>  		/* Move the dirty bit to the physical page now the pte is gone. */
> -		if (pte_dirty(pteval))
> +		if (pte_dirty(pteval)) {
> +			SetPageUpdateCMTime(page);
>  			set_page_dirty(page);
> +		}
>  
>  		page_remove_rmap(page);
>  		page_cache_release(page);
> -- 
> 1.7.6.4
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
@ 2011-11-01 22:53                 ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-01 22:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4

On Fri 28-10-11 16:39:25, Andy Lutomirski wrote:
> We used to update a file's time on do_wp_page.  This caused latency
> whenever file_update_time would sleep (this happens on ext4).  It is
> also, IMO, less than ideal: any copy, backup, or 'make' run taken
> after do_wp_page but before an mmap user finished writing would see
> the new timestamp but the old contents.
> 
> With this patch, cmtime is updated after a page is written.  When the
> mm code transfers the dirty bit from a pte to the associated struct
> page, it also sets a new page flag indicating that the page was
> modified directly from userspace.  The inode's time is then updated in
> clear_page_dirty_for_io.
> 
> We can't update cmtime in all contexts in which ptes are unmapped:
> various reclaim paths can unmap ptes from GFP_NOFS paths.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
> 
> I'm not thrilled about using a page flag for this, but I haven't
> spotted a better way.  Updating the time in writepage would touch
> a lot of files and would interact oddly with write.
  I see two problems with this patch:
1) Using a page flags is really a no-go. We are rather short on page flags
so using them for such minor thing is a real wastage. Moreover it should be
rather easy to just use an inode flag instead.

2) You cannot call inode_update_time_writable() from
clear_page_dirty_for_io() because that is called under a page lock and thus
would create lock inversion problems.

								Honza
> 
>  fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
>  include/linux/fs.h         |    1 +
>  include/linux/page-flags.h |    5 ++++
>  mm/page-writeback.c        |   19 +++++++++++++---
>  mm/rmap.c                  |   18 +++++++++++++-
>  5 files changed, 72 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index ec79246..ee93a25 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
>  }
>  EXPORT_SYMBOL(touch_atime);
>  
> -/**
> - *	file_update_time	-	update mtime and ctime time
> - *	@file: file accessed
> - *
> - *	Update the mtime and ctime members of an inode and mark the inode
> - *	for writeback.  Note that this function is meant exclusively for
> - *	usage in the file write path of filesystems, and filesystems may
> - *	choose to explicitly ignore update via this function with the
> - *	S_NOCMTIME inode flag, e.g. for network filesystem where these
> - *	timestamps are handled by the server.
> - */
> -
> -void file_update_time(struct file *file)
> +static void do_inode_update_time(struct file *file, struct inode *inode)
>  {
> -	struct inode *inode = file->f_path.dentry->d_inode;
>  	struct timespec now;
>  	enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
>  
> @@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
>  		return;
>  
>  	/* Finally allowed to write? Takes lock. */
> -	if (mnt_want_write_file(file))
> +	if (file && mnt_want_write_file(file))
>  		return;
>  
>  	/* Only change inode inside the lock region */
> @@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
>  	if (sync_it & S_MTIME)
>  		inode->i_mtime = now;
>  	mark_inode_dirty_sync(inode);
> -	mnt_drop_write(file->f_path.mnt);
> +
> +	if (file)
> +		mnt_drop_write(file->f_path.mnt);
> +}
> +
> +/**
> + *	file_update_time	-	update mtime and ctime time
> + *	@file: file accessed
> + *
> + *	Update the mtime and ctime members of an inode and mark the inode
> + *	for writeback.  Note that this function is meant exclusively for
> + *	usage in the file write path of filesystems, and filesystems may
> + *	choose to explicitly ignore update via this function with the
> + *	S_NOCMTIME inode flag, e.g. for network filesystem where these
> + *	timestamps are handled by the server.
> + */
> +
> +void file_update_time(struct file *file)
> +{
> +	do_inode_update_time(file, file->f_path.dentry->d_inode);
>  }
>  EXPORT_SYMBOL(file_update_time);
>  
> +/**
> + *	inode_update_time_writable	-	update mtime and ctime
> + *	@inode: inode accessed
> + *
> + *	Same as file_update_time, except that the caller is responsible
> + *	for checking that the mount is writable.
> + */
> +
> +void inode_update_time_writable(struct inode *inode)
> +{
> +	do_inode_update_time(0, inode);
> +}
> +
>  int inode_needs_sync(struct inode *inode)
>  {
>  	if (IS_SYNC(inode))
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 277f497..9e28927 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
>  extern void setattr_copy(struct inode *inode, const struct iattr *attr);
>  
>  extern void file_update_time(struct file *file);
> +extern void inode_update_time_writable(struct inode *inode);
>  
>  extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
>  extern void save_mount_options(struct super_block *sb, char *options);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e90a673..4eed012 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -107,6 +107,7 @@ enum pageflags {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	PG_compound_lock,
>  #endif
> +	PG_update_cmtime,	/* Dirtied via writable mapping. */
>  	__NR_PAGEFLAGS,
>  
>  	/* Filesystems */
> @@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
>  #define __PG_HWPOISON 0
>  #endif
>  
> +/* Whoever clears PG_update_cmtime must update the cmtime. */
> +SETPAGEFLAG(UpdateCMTime, update_cmtime)
> +TESTCLEARFLAG(UpdateCMTime, update_cmtime)
> +
>  u64 stable_page_flags(struct page *page);
>  
>  static inline int PageUptodate(struct page *page)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0e309cd..41c48ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  
>  /*
>   * Clear a page's dirty flag, while caring for dirty memory accounting.
> - * Returns true if the page was previously dirty.
> + * Returns true if the page was previously dirty.  Also updates inode time
> + * if necessary.
>   *
>   * This is for preparing to put the page under writeout.  We leave the page
>   * tagged as dirty in the radix tree so that a concurrent write-for-sync
> @@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>   */
>  int clear_page_dirty_for_io(struct page *page)
>  {
> +	int ret;
>  	struct address_space *mapping = page_mapping(page);
>  
>  	BUG_ON(!PageLocked(page));
> @@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
> +			goto out;
>  		}
> -		return 0;
> +		ret = 0;
> +		goto out;
>  	}
> -	return TestClearPageDirty(page);
> +	ret = TestClearPageDirty(page);
> +
> +out:
> +	/* We know that the inode (if any) is on a writable mount. */
> +	if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
> +		inode_update_time_writable(mapping->host);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL(clear_page_dirty_for_io);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..2ee595d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
>  		struct address_space *mapping = page_mapping(page);
>  		if (mapping) {
>  			ret = page_mkclean_file(mapping, page);
> +
> +			/*
> +			 * If dirtied via shared writable mapping, cmtime
> +			 * needs to be updated.  If dirtied only through
> +			 * write(), etc, then the writer already updated
> +			 * cmtime.
> +			 */
> +			if (ret)
> +				SetPageUpdateCMTime(page);
> +
>  			if (page_test_and_clear_dirty(page_to_pfn(page), 1))
>  				ret = 1;
>  		}
> @@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pteval = ptep_clear_flush_notify(vma, address, pte);
>  
>  	/* Move the dirty bit to the physical page now the pte is gone. */
> -	if (pte_dirty(pteval))
> +	if (pte_dirty(pteval)) {
> +		SetPageUpdateCMTime(page);
>  		set_page_dirty(page);
> +	}
>  
>  	/* Update high watermark before we lower rss */
>  	update_hiwater_rss(mm);
> @@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  			set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
>  
>  		/* Move the dirty bit to the physical page now the pte is gone. */
> -		if (pte_dirty(pteval))
> +		if (pte_dirty(pteval)) {
> +			SetPageUpdateCMTime(page);
>  			set_page_dirty(page);
> +		}
>  
>  		page_remove_rmap(page);
>  		page_cache_release(page);
> -- 
> 1.7.6.4
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
  2011-11-01 22:53                 ` Jan Kara
  (?)
@ 2011-11-01 23:02                   ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-01 23:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel, linux-ext4

On Tue, Nov 1, 2011 at 3:53 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 28-10-11 16:39:25, Andy Lutomirski wrote:
>> We used to update a file's time on do_wp_page.  This caused latency
>> whenever file_update_time would sleep (this happens on ext4).  It is
>> also, IMO, less than ideal: any copy, backup, or 'make' run taken
>> after do_wp_page but before an mmap user finished writing would see
>> the new timestamp but the old contents.
>>
>> With this patch, cmtime is updated after a page is written.  When the
>> mm code transfers the dirty bit from a pte to the associated struct
>> page, it also sets a new page flag indicating that the page was
>> modified directly from userspace.  The inode's time is then updated in
>> clear_page_dirty_for_io.
>>
>> We can't update cmtime in all contexts in which ptes are unmapped:
>> various reclaim paths can unmap ptes from GFP_NOFS paths.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>
>> I'm not thrilled about using a page flag for this, but I haven't
>> spotted a better way.  Updating the time in writepage would touch
>> a lot of files and would interact oddly with write.
>  I see two problems with this patch:
> 1) Using a page flags is really a no-go. We are rather short on page flags
> so using them for such minor thing is a real wastage. Moreover it should be
> rather easy to just use an inode flag instead.

Am I allowed to set inode flags without holding any locks?

>
> 2) You cannot call inode_update_time_writable() from
> clear_page_dirty_for_io() because that is called under a page lock and thus
> would create lock inversion problems.
>

Hmm.  Isn't it permitted to at least read from an fs while holding the
page lock?  I thought that the page lock was held for the entire
duration of a read and at the beginning of writeback.

I can push this down to the ->writepage implementations or to the
clear_page_dirty_for_io callers, but that will result in a bigger
patch.

--Andy

>                                                                Honza
>>
>>  fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
>>  include/linux/fs.h         |    1 +
>>  include/linux/page-flags.h |    5 ++++
>>  mm/page-writeback.c        |   19 +++++++++++++---
>>  mm/rmap.c                  |   18 +++++++++++++-
>>  5 files changed, 72 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index ec79246..ee93a25 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
>>  }
>>  EXPORT_SYMBOL(touch_atime);
>>
>> -/**
>> - *   file_update_time        -       update mtime and ctime time
>> - *   @file: file accessed
>> - *
>> - *   Update the mtime and ctime members of an inode and mark the inode
>> - *   for writeback.  Note that this function is meant exclusively for
>> - *   usage in the file write path of filesystems, and filesystems may
>> - *   choose to explicitly ignore update via this function with the
>> - *   S_NOCMTIME inode flag, e.g. for network filesystem where these
>> - *   timestamps are handled by the server.
>> - */
>> -
>> -void file_update_time(struct file *file)
>> +static void do_inode_update_time(struct file *file, struct inode *inode)
>>  {
>> -     struct inode *inode = file->f_path.dentry->d_inode;
>>       struct timespec now;
>>       enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
>>
>> @@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
>>               return;
>>
>>       /* Finally allowed to write? Takes lock. */
>> -     if (mnt_want_write_file(file))
>> +     if (file && mnt_want_write_file(file))
>>               return;
>>
>>       /* Only change inode inside the lock region */
>> @@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
>>       if (sync_it & S_MTIME)
>>               inode->i_mtime = now;
>>       mark_inode_dirty_sync(inode);
>> -     mnt_drop_write(file->f_path.mnt);
>> +
>> +     if (file)
>> +             mnt_drop_write(file->f_path.mnt);
>> +}
>> +
>> +/**
>> + *   file_update_time        -       update mtime and ctime time
>> + *   @file: file accessed
>> + *
>> + *   Update the mtime and ctime members of an inode and mark the inode
>> + *   for writeback.  Note that this function is meant exclusively for
>> + *   usage in the file write path of filesystems, and filesystems may
>> + *   choose to explicitly ignore update via this function with the
>> + *   S_NOCMTIME inode flag, e.g. for network filesystem where these
>> + *   timestamps are handled by the server.
>> + */
>> +
>> +void file_update_time(struct file *file)
>> +{
>> +     do_inode_update_time(file, file->f_path.dentry->d_inode);
>>  }
>>  EXPORT_SYMBOL(file_update_time);
>>
>> +/**
>> + *   inode_update_time_writable      -       update mtime and ctime
>> + *   @inode: inode accessed
>> + *
>> + *   Same as file_update_time, except that the caller is responsible
>> + *   for checking that the mount is writable.
>> + */
>> +
>> +void inode_update_time_writable(struct inode *inode)
>> +{
>> +     do_inode_update_time(0, inode);
>> +}
>> +
>>  int inode_needs_sync(struct inode *inode)
>>  {
>>       if (IS_SYNC(inode))
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 277f497..9e28927 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
>>  extern void setattr_copy(struct inode *inode, const struct iattr *attr);
>>
>>  extern void file_update_time(struct file *file);
>> +extern void inode_update_time_writable(struct inode *inode);
>>
>>  extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
>>  extern void save_mount_options(struct super_block *sb, char *options);
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index e90a673..4eed012 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -107,6 +107,7 @@ enum pageflags {
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>       PG_compound_lock,
>>  #endif
>> +     PG_update_cmtime,       /* Dirtied via writable mapping. */
>>       __NR_PAGEFLAGS,
>>
>>       /* Filesystems */
>> @@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
>>  #define __PG_HWPOISON 0
>>  #endif
>>
>> +/* Whoever clears PG_update_cmtime must update the cmtime. */
>> +SETPAGEFLAG(UpdateCMTime, update_cmtime)
>> +TESTCLEARFLAG(UpdateCMTime, update_cmtime)
>> +
>>  u64 stable_page_flags(struct page *page);
>>
>>  static inline int PageUptodate(struct page *page)
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 0e309cd..41c48ea 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>
>>  /*
>>   * Clear a page's dirty flag, while caring for dirty memory accounting.
>> - * Returns true if the page was previously dirty.
>> + * Returns true if the page was previously dirty.  Also updates inode time
>> + * if necessary.
>>   *
>>   * This is for preparing to put the page under writeout.  We leave the page
>>   * tagged as dirty in the radix tree so that a concurrent write-for-sync
>> @@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>   */
>>  int clear_page_dirty_for_io(struct page *page)
>>  {
>> +     int ret;
>>       struct address_space *mapping = page_mapping(page);
>>
>>       BUG_ON(!PageLocked(page));
>> @@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
>>                       dec_zone_page_state(page, NR_FILE_DIRTY);
>>                       dec_bdi_stat(mapping->backing_dev_info,
>>                                       BDI_RECLAIMABLE);
>> -                     return 1;
>> +                     ret = 1;
>> +                     goto out;
>>               }
>> -             return 0;
>> +             ret = 0;
>> +             goto out;
>>       }
>> -     return TestClearPageDirty(page);
>> +     ret = TestClearPageDirty(page);
>> +
>> +out:
>> +     /* We know that the inode (if any) is on a writable mount. */
>> +     if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
>> +             inode_update_time_writable(mapping->host);
>> +
>> +     return ret;
>>  }
>>  EXPORT_SYMBOL(clear_page_dirty_for_io);
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 8005080..2ee595d 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
>>               struct address_space *mapping = page_mapping(page);
>>               if (mapping) {
>>                       ret = page_mkclean_file(mapping, page);
>> +
>> +                     /*
>> +                      * If dirtied via shared writable mapping, cmtime
>> +                      * needs to be updated.  If dirtied only through
>> +                      * write(), etc, then the writer already updated
>> +                      * cmtime.
>> +                      */
>> +                     if (ret)
>> +                             SetPageUpdateCMTime(page);
>> +
>>                       if (page_test_and_clear_dirty(page_to_pfn(page), 1))
>>                               ret = 1;
>>               }
>> @@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>       pteval = ptep_clear_flush_notify(vma, address, pte);
>>
>>       /* Move the dirty bit to the physical page now the pte is gone. */
>> -     if (pte_dirty(pteval))
>> +     if (pte_dirty(pteval)) {
>> +             SetPageUpdateCMTime(page);
>>               set_page_dirty(page);
>> +     }
>>
>>       /* Update high watermark before we lower rss */
>>       update_hiwater_rss(mm);
>> @@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>                       set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
>>
>>               /* Move the dirty bit to the physical page now the pte is gone. */
>> -             if (pte_dirty(pteval))
>> +             if (pte_dirty(pteval)) {
>> +                     SetPageUpdateCMTime(page);
>>                       set_page_dirty(page);
>> +             }
>>
>>               page_remove_rmap(page);
>>               page_cache_release(page);
>> --
>> 1.7.6.4
>>
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
Office: (310) 553-5322
Mobile: (650) 906-0647

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
@ 2011-11-01 23:02                   ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-01 23:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel, linux-ext4

On Tue, Nov 1, 2011 at 3:53 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 28-10-11 16:39:25, Andy Lutomirski wrote:
>> We used to update a file's time on do_wp_page.  This caused latency
>> whenever file_update_time would sleep (this happens on ext4).  It is
>> also, IMO, less than ideal: any copy, backup, or 'make' run taken
>> after do_wp_page but before an mmap user finished writing would see
>> the new timestamp but the old contents.
>>
>> With this patch, cmtime is updated after a page is written.  When the
>> mm code transfers the dirty bit from a pte to the associated struct
>> page, it also sets a new page flag indicating that the page was
>> modified directly from userspace.  The inode's time is then updated in
>> clear_page_dirty_for_io.
>>
>> We can't update cmtime in all contexts in which ptes are unmapped:
>> various reclaim paths can unmap ptes from GFP_NOFS paths.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>
>> I'm not thrilled about using a page flag for this, but I haven't
>> spotted a better way.  Updating the time in writepage would touch
>> a lot of files and would interact oddly with write.
>  I see two problems with this patch:
> 1) Using a page flags is really a no-go. We are rather short on page flags
> so using them for such minor thing is a real wastage. Moreover it should be
> rather easy to just use an inode flag instead.

Am I allowed to set inode flags without holding any locks?

>
> 2) You cannot call inode_update_time_writable() from
> clear_page_dirty_for_io() because that is called under a page lock and thus
> would create lock inversion problems.
>

Hmm.  Isn't it permitted to at least read from an fs while holding the
page lock?  I thought that the page lock was held for the entire
duration of a read and at the beginning of writeback.

I can push this down to the ->writepage implementations or to the
clear_page_dirty_for_io callers, but that will result in a bigger
patch.

--Andy

>                                                                Honza
>>
>>  fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
>>  include/linux/fs.h         |    1 +
>>  include/linux/page-flags.h |    5 ++++
>>  mm/page-writeback.c        |   19 +++++++++++++---
>>  mm/rmap.c                  |   18 +++++++++++++-
>>  5 files changed, 72 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index ec79246..ee93a25 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
>>  }
>>  EXPORT_SYMBOL(touch_atime);
>>
>> -/**
>> - *   file_update_time        -       update mtime and ctime time
>> - *   @file: file accessed
>> - *
>> - *   Update the mtime and ctime members of an inode and mark the inode
>> - *   for writeback.  Note that this function is meant exclusively for
>> - *   usage in the file write path of filesystems, and filesystems may
>> - *   choose to explicitly ignore update via this function with the
>> - *   S_NOCMTIME inode flag, e.g. for network filesystem where these
>> - *   timestamps are handled by the server.
>> - */
>> -
>> -void file_update_time(struct file *file)
>> +static void do_inode_update_time(struct file *file, struct inode *inode)
>>  {
>> -     struct inode *inode = file->f_path.dentry->d_inode;
>>       struct timespec now;
>>       enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
>>
>> @@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
>>               return;
>>
>>       /* Finally allowed to write? Takes lock. */
>> -     if (mnt_want_write_file(file))
>> +     if (file && mnt_want_write_file(file))
>>               return;
>>
>>       /* Only change inode inside the lock region */
>> @@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
>>       if (sync_it & S_MTIME)
>>               inode->i_mtime = now;
>>       mark_inode_dirty_sync(inode);
>> -     mnt_drop_write(file->f_path.mnt);
>> +
>> +     if (file)
>> +             mnt_drop_write(file->f_path.mnt);
>> +}
>> +
>> +/**
>> + *   file_update_time        -       update mtime and ctime time
>> + *   @file: file accessed
>> + *
>> + *   Update the mtime and ctime members of an inode and mark the inode
>> + *   for writeback.  Note that this function is meant exclusively for
>> + *   usage in the file write path of filesystems, and filesystems may
>> + *   choose to explicitly ignore update via this function with the
>> + *   S_NOCMTIME inode flag, e.g. for network filesystem where these
>> + *   timestamps are handled by the server.
>> + */
>> +
>> +void file_update_time(struct file *file)
>> +{
>> +     do_inode_update_time(file, file->f_path.dentry->d_inode);
>>  }
>>  EXPORT_SYMBOL(file_update_time);
>>
>> +/**
>> + *   inode_update_time_writable      -       update mtime and ctime
>> + *   @inode: inode accessed
>> + *
>> + *   Same as file_update_time, except that the caller is responsible
>> + *   for checking that the mount is writable.
>> + */
>> +
>> +void inode_update_time_writable(struct inode *inode)
>> +{
>> +     do_inode_update_time(0, inode);
>> +}
>> +
>>  int inode_needs_sync(struct inode *inode)
>>  {
>>       if (IS_SYNC(inode))
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 277f497..9e28927 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
>>  extern void setattr_copy(struct inode *inode, const struct iattr *attr);
>>
>>  extern void file_update_time(struct file *file);
>> +extern void inode_update_time_writable(struct inode *inode);
>>
>>  extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
>>  extern void save_mount_options(struct super_block *sb, char *options);
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index e90a673..4eed012 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -107,6 +107,7 @@ enum pageflags {
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>       PG_compound_lock,
>>  #endif
>> +     PG_update_cmtime,       /* Dirtied via writable mapping. */
>>       __NR_PAGEFLAGS,
>>
>>       /* Filesystems */
>> @@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
>>  #define __PG_HWPOISON 0
>>  #endif
>>
>> +/* Whoever clears PG_update_cmtime must update the cmtime. */
>> +SETPAGEFLAG(UpdateCMTime, update_cmtime)
>> +TESTCLEARFLAG(UpdateCMTime, update_cmtime)
>> +
>>  u64 stable_page_flags(struct page *page);
>>
>>  static inline int PageUptodate(struct page *page)
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 0e309cd..41c48ea 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>
>>  /*
>>   * Clear a page's dirty flag, while caring for dirty memory accounting.
>> - * Returns true if the page was previously dirty.
>> + * Returns true if the page was previously dirty.  Also updates inode time
>> + * if necessary.
>>   *
>>   * This is for preparing to put the page under writeout.  We leave the page
>>   * tagged as dirty in the radix tree so that a concurrent write-for-sync
>> @@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>   */
>>  int clear_page_dirty_for_io(struct page *page)
>>  {
>> +     int ret;
>>       struct address_space *mapping = page_mapping(page);
>>
>>       BUG_ON(!PageLocked(page));
>> @@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
>>                       dec_zone_page_state(page, NR_FILE_DIRTY);
>>                       dec_bdi_stat(mapping->backing_dev_info,
>>                                       BDI_RECLAIMABLE);
>> -                     return 1;
>> +                     ret = 1;
>> +                     goto out;
>>               }
>> -             return 0;
>> +             ret = 0;
>> +             goto out;
>>       }
>> -     return TestClearPageDirty(page);
>> +     ret = TestClearPageDirty(page);
>> +
>> +out:
>> +     /* We know that the inode (if any) is on a writable mount. */
>> +     if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
>> +             inode_update_time_writable(mapping->host);
>> +
>> +     return ret;
>>  }
>>  EXPORT_SYMBOL(clear_page_dirty_for_io);
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 8005080..2ee595d 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
>>               struct address_space *mapping = page_mapping(page);
>>               if (mapping) {
>>                       ret = page_mkclean_file(mapping, page);
>> +
>> +                     /*
>> +                      * If dirtied via shared writable mapping, cmtime
>> +                      * needs to be updated.  If dirtied only through
>> +                      * write(), etc, then the writer already updated
>> +                      * cmtime.
>> +                      */
>> +                     if (ret)
>> +                             SetPageUpdateCMTime(page);
>> +
>>                       if (page_test_and_clear_dirty(page_to_pfn(page), 1))
>>                               ret = 1;
>>               }
>> @@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>       pteval = ptep_clear_flush_notify(vma, address, pte);
>>
>>       /* Move the dirty bit to the physical page now the pte is gone. */
>> -     if (pte_dirty(pteval))
>> +     if (pte_dirty(pteval)) {
>> +             SetPageUpdateCMTime(page);
>>               set_page_dirty(page);
>> +     }
>>
>>       /* Update high watermark before we lower rss */
>>       update_hiwater_rss(mm);
>> @@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>                       set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
>>
>>               /* Move the dirty bit to the physical page now the pte is gone. */
>> -             if (pte_dirty(pteval))
>> +             if (pte_dirty(pteval)) {
>> +                     SetPageUpdateCMTime(page);
>>                       set_page_dirty(page);
>> +             }
>>
>>               page_remove_rmap(page);
>>               page_cache_release(page);
>> --
>> 1.7.6.4
>>
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
Office: (310) 553-5322
Mobile: (650) 906-0647
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
@ 2011-11-01 23:02                   ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-01 23:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel, linux-ext4

On Tue, Nov 1, 2011 at 3:53 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 28-10-11 16:39:25, Andy Lutomirski wrote:
>> We used to update a file's time on do_wp_page.  This caused latency
>> whenever file_update_time would sleep (this happens on ext4).  It is
>> also, IMO, less than ideal: any copy, backup, or 'make' run taken
>> after do_wp_page but before an mmap user finished writing would see
>> the new timestamp but the old contents.
>>
>> With this patch, cmtime is updated after a page is written.  When the
>> mm code transfers the dirty bit from a pte to the associated struct
>> page, it also sets a new page flag indicating that the page was
>> modified directly from userspace.  The inode's time is then updated in
>> clear_page_dirty_for_io.
>>
>> We can't update cmtime in all contexts in which ptes are unmapped:
>> various reclaim paths can unmap ptes from GFP_NOFS paths.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>
>> I'm not thrilled about using a page flag for this, but I haven't
>> spotted a better way.  Updating the time in writepage would touch
>> a lot of files and would interact oddly with write.
>  I see two problems with this patch:
> 1) Using a page flags is really a no-go. We are rather short on page flags
> so using them for such minor thing is a real wastage. Moreover it should be
> rather easy to just use an inode flag instead.

Am I allowed to set inode flags without holding any locks?

>
> 2) You cannot call inode_update_time_writable() from
> clear_page_dirty_for_io() because that is called under a page lock and thus
> would create lock inversion problems.
>

Hmm.  Isn't it permitted to at least read from an fs while holding the
page lock?  I thought that the page lock was held for the entire
duration of a read and at the beginning of writeback.

I can push this down to the ->writepage implementations or to the
clear_page_dirty_for_io callers, but that will result in a bigger
patch.

--Andy

>                                                                Honza
>>
>>  fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
>>  include/linux/fs.h         |    1 +
>>  include/linux/page-flags.h |    5 ++++
>>  mm/page-writeback.c        |   19 +++++++++++++---
>>  mm/rmap.c                  |   18 +++++++++++++-
>>  5 files changed, 72 insertions(+), 22 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index ec79246..ee93a25 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
>>  }
>>  EXPORT_SYMBOL(touch_atime);
>>
>> -/**
>> - *   file_update_time        -       update mtime and ctime time
>> - *   @file: file accessed
>> - *
>> - *   Update the mtime and ctime members of an inode and mark the inode
>> - *   for writeback.  Note that this function is meant exclusively for
>> - *   usage in the file write path of filesystems, and filesystems may
>> - *   choose to explicitly ignore update via this function with the
>> - *   S_NOCMTIME inode flag, e.g. for network filesystem where these
>> - *   timestamps are handled by the server.
>> - */
>> -
>> -void file_update_time(struct file *file)
>> +static void do_inode_update_time(struct file *file, struct inode *inode)
>>  {
>> -     struct inode *inode = file->f_path.dentry->d_inode;
>>       struct timespec now;
>>       enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
>>
>> @@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
>>               return;
>>
>>       /* Finally allowed to write? Takes lock. */
>> -     if (mnt_want_write_file(file))
>> +     if (file && mnt_want_write_file(file))
>>               return;
>>
>>       /* Only change inode inside the lock region */
>> @@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
>>       if (sync_it & S_MTIME)
>>               inode->i_mtime = now;
>>       mark_inode_dirty_sync(inode);
>> -     mnt_drop_write(file->f_path.mnt);
>> +
>> +     if (file)
>> +             mnt_drop_write(file->f_path.mnt);
>> +}
>> +
>> +/**
>> + *   file_update_time        -       update mtime and ctime time
>> + *   @file: file accessed
>> + *
>> + *   Update the mtime and ctime members of an inode and mark the inode
>> + *   for writeback.  Note that this function is meant exclusively for
>> + *   usage in the file write path of filesystems, and filesystems may
>> + *   choose to explicitly ignore update via this function with the
>> + *   S_NOCMTIME inode flag, e.g. for network filesystem where these
>> + *   timestamps are handled by the server.
>> + */
>> +
>> +void file_update_time(struct file *file)
>> +{
>> +     do_inode_update_time(file, file->f_path.dentry->d_inode);
>>  }
>>  EXPORT_SYMBOL(file_update_time);
>>
>> +/**
>> + *   inode_update_time_writable      -       update mtime and ctime
>> + *   @inode: inode accessed
>> + *
>> + *   Same as file_update_time, except that the caller is responsible
>> + *   for checking that the mount is writable.
>> + */
>> +
>> +void inode_update_time_writable(struct inode *inode)
>> +{
>> +     do_inode_update_time(0, inode);
>> +}
>> +
>>  int inode_needs_sync(struct inode *inode)
>>  {
>>       if (IS_SYNC(inode))
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 277f497..9e28927 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
>>  extern void setattr_copy(struct inode *inode, const struct iattr *attr);
>>
>>  extern void file_update_time(struct file *file);
>> +extern void inode_update_time_writable(struct inode *inode);
>>
>>  extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
>>  extern void save_mount_options(struct super_block *sb, char *options);
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index e90a673..4eed012 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -107,6 +107,7 @@ enum pageflags {
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>       PG_compound_lock,
>>  #endif
>> +     PG_update_cmtime,       /* Dirtied via writable mapping. */
>>       __NR_PAGEFLAGS,
>>
>>       /* Filesystems */
>> @@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
>>  #define __PG_HWPOISON 0
>>  #endif
>>
>> +/* Whoever clears PG_update_cmtime must update the cmtime. */
>> +SETPAGEFLAG(UpdateCMTime, update_cmtime)
>> +TESTCLEARFLAG(UpdateCMTime, update_cmtime)
>> +
>>  u64 stable_page_flags(struct page *page);
>>
>>  static inline int PageUptodate(struct page *page)
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 0e309cd..41c48ea 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>
>>  /*
>>   * Clear a page's dirty flag, while caring for dirty memory accounting.
>> - * Returns true if the page was previously dirty.
>> + * Returns true if the page was previously dirty.  Also updates inode time
>> + * if necessary.
>>   *
>>   * This is for preparing to put the page under writeout.  We leave the page
>>   * tagged as dirty in the radix tree so that a concurrent write-for-sync
>> @@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>   */
>>  int clear_page_dirty_for_io(struct page *page)
>>  {
>> +     int ret;
>>       struct address_space *mapping = page_mapping(page);
>>
>>       BUG_ON(!PageLocked(page));
>> @@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
>>                       dec_zone_page_state(page, NR_FILE_DIRTY);
>>                       dec_bdi_stat(mapping->backing_dev_info,
>>                                       BDI_RECLAIMABLE);
>> -                     return 1;
>> +                     ret = 1;
>> +                     goto out;
>>               }
>> -             return 0;
>> +             ret = 0;
>> +             goto out;
>>       }
>> -     return TestClearPageDirty(page);
>> +     ret = TestClearPageDirty(page);
>> +
>> +out:
>> +     /* We know that the inode (if any) is on a writable mount. */
>> +     if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
>> +             inode_update_time_writable(mapping->host);
>> +
>> +     return ret;
>>  }
>>  EXPORT_SYMBOL(clear_page_dirty_for_io);
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 8005080..2ee595d 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
>>               struct address_space *mapping = page_mapping(page);
>>               if (mapping) {
>>                       ret = page_mkclean_file(mapping, page);
>> +
>> +                     /*
>> +                      * If dirtied via shared writable mapping, cmtime
>> +                      * needs to be updated.  If dirtied only through
>> +                      * write(), etc, then the writer already updated
>> +                      * cmtime.
>> +                      */
>> +                     if (ret)
>> +                             SetPageUpdateCMTime(page);
>> +
>>                       if (page_test_and_clear_dirty(page_to_pfn(page), 1))
>>                               ret = 1;
>>               }
>> @@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>       pteval = ptep_clear_flush_notify(vma, address, pte);
>>
>>       /* Move the dirty bit to the physical page now the pte is gone. */
>> -     if (pte_dirty(pteval))
>> +     if (pte_dirty(pteval)) {
>> +             SetPageUpdateCMTime(page);
>>               set_page_dirty(page);
>> +     }
>>
>>       /* Update high watermark before we lower rss */
>>       update_hiwater_rss(mm);
>> @@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>                       set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
>>
>>               /* Move the dirty bit to the physical page now the pte is gone. */
>> -             if (pte_dirty(pteval))
>> +             if (pte_dirty(pteval)) {
>> +                     SetPageUpdateCMTime(page);
>>                       set_page_dirty(page);
>> +             }
>>
>>               page_remove_rmap(page);
>>               page_cache_release(page);
>> --
>> 1.7.6.4
>>
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
Office: (310) 553-5322
Mobile: (650) 906-0647

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-10-31 23:14                 ` Andy Lutomirski
  (?)
@ 2011-11-01 23:03                   ` Jan Kara
  -1 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-01 23:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Mon 31-10-11 16:14:47, Andy Lutomirski wrote:
> On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
> >> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
> >> >>  - Why are we calling file_update_time at all?  Presumably we also
> >> >> update the time when the page is written back (if not, that sounds
> >> >> like a bug, since the contents may be changed after something saw the
> >> >> mtime update), and, if so, why bother updating it on the first write?
> >> >> Anything that relies on this behavior is, I think, unreliable, because
> >> >> the page could be made writable arbitrarily early by another program
> >> >> that changes nothing.
> >> >  We don't update timestamp when the page is written back. I believe this
> >> > is mostly because we don't know whether the data has been changed by a
> >> > write syscall, which already updated the timestamp, or by mmap. That is
> >> > also the reason why we update the timestamp at page fault time.
> >> >
> >> >  The reason why file_update_time() blocks for you is probably that it
> >> > needs to get access to buffer where inode is stored on disk and because a
> >> > transaction including this buffer is committing at the moment, your thread
> >> > has to wait until the transaction commit finishes. This is mostly a problem
> >> > specific to how ext4 works so e.g. xfs shouldn't have it.
> >> >
> >> >  Generally I believe the attempts to achieve any RT-like latencies when
> >> > writing to a filesystem are rather hopeless. How much hopeless depends on
> >> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
> >> > can imagine some tweaks could reduce your latencies to an acceptable level
> >> > but once the disk gets loaded you'll be screwed). So I'd suggest that
> >> > having RT thread just store log in memory (or write to a pipe) and have
> >> > another non-RT thread write the data to disk would be a much more robust
> >> > design.
> >>
> >> Windows seems to do pretty well at this, and I think it should be fixable on
> >> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
> >> page_mkclean_one.  The fallout from that might be unpleasant, though, but
> >> it would probably speed up a number of workloads.
> >  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
> > unlikely to go away in a forseeable future. The reason is that we need to
> > reliably account the number of dirty pages so that we can throttle
> > processes that dirty too much of memory and also protect agaist system
> > going into out-of-memory problems when too many pages would be dirty (and
> > thus hard to reclaim). Thus we create clean pages as write-protected, when
> > they are first written to, we account them as dirtied and unprotect them.
> > When pages are cleaned by writeback, we decrement number of dirty pages
> > accordingly and write-protect them again.
> 
> What about skipping pte_wrprotect for mlocked pages and continuing to
> account them dirty even if they're actually clean?  This should be a
> straightforward patch except for the effect on stable pages for
> writeback.  (It would also have unfortunate side effects on
> ctime/mtime without my other patch to rearrange that code.)
  Well, doing proper dirty accounting would be a mess (you'd have to
unaccount dirty pages during munlock etc.) and I'm not sure what all would
break when page writes would not be coupled with page faults. So I don't
think it's really worth it.

Avoiding IO during a minor fault would be a decent thing which might be
worth pursuing. As you properly noted "stable pages during writeback"
requirement is one obstacle which won't be that trivial to avoid though...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-11-01 23:03                   ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-01 23:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Mon 31-10-11 16:14:47, Andy Lutomirski wrote:
> On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
> >> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
> >> >>  - Why are we calling file_update_time at all?  Presumably we also
> >> >> update the time when the page is written back (if not, that sounds
> >> >> like a bug, since the contents may be changed after something saw the
> >> >> mtime update), and, if so, why bother updating it on the first write?
> >> >> Anything that relies on this behavior is, I think, unreliable, because
> >> >> the page could be made writable arbitrarily early by another program
> >> >> that changes nothing.
> >> >  We don't update timestamp when the page is written back. I believe this
> >> > is mostly because we don't know whether the data has been changed by a
> >> > write syscall, which already updated the timestamp, or by mmap. That is
> >> > also the reason why we update the timestamp at page fault time.
> >> >
> >> >  The reason why file_update_time() blocks for you is probably that it
> >> > needs to get access to buffer where inode is stored on disk and because a
> >> > transaction including this buffer is committing at the moment, your thread
> >> > has to wait until the transaction commit finishes. This is mostly a problem
> >> > specific to how ext4 works so e.g. xfs shouldn't have it.
> >> >
> >> >  Generally I believe the attempts to achieve any RT-like latencies when
> >> > writing to a filesystem are rather hopeless. How much hopeless depends on
> >> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
> >> > can imagine some tweaks could reduce your latencies to an acceptable level
> >> > but once the disk gets loaded you'll be screwed). So I'd suggest that
> >> > having RT thread just store log in memory (or write to a pipe) and have
> >> > another non-RT thread write the data to disk would be a much more robust
> >> > design.
> >>
> >> Windows seems to do pretty well at this, and I think it should be fixable on
> >> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
> >> page_mkclean_one.  The fallout from that might be unpleasant, though, but
> >> it would probably speed up a number of workloads.
> >  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
> > unlikely to go away in a forseeable future. The reason is that we need to
> > reliably account the number of dirty pages so that we can throttle
> > processes that dirty too much of memory and also protect agaist system
> > going into out-of-memory problems when too many pages would be dirty (and
> > thus hard to reclaim). Thus we create clean pages as write-protected, when
> > they are first written to, we account them as dirtied and unprotect them.
> > When pages are cleaned by writeback, we decrement number of dirty pages
> > accordingly and write-protect them again.
> 
> What about skipping pte_wrprotect for mlocked pages and continuing to
> account them dirty even if they're actually clean?  This should be a
> straightforward patch except for the effect on stable pages for
> writeback.  (It would also have unfortunate side effects on
> ctime/mtime without my other patch to rearrange that code.)
  Well, doing proper dirty accounting would be a mess (you'd have to
unaccount dirty pages during munlock etc.) and I'm not sure what all would
break when page writes would not be coupled with page faults. So I don't
think it's really worth it.

Avoiding IO during a minor fault would be a decent thing which might be
worth pursuing. As you properly noted "stable pages during writeback"
requirement is one obstacle which won't be that trivial to avoid though...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-11-01 23:03                   ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-01 23:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Mon 31-10-11 16:14:47, Andy Lutomirski wrote:
> On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
> >> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
> >> >>  - Why are we calling file_update_time at all?  Presumably we also
> >> >> update the time when the page is written back (if not, that sounds
> >> >> like a bug, since the contents may be changed after something saw the
> >> >> mtime update), and, if so, why bother updating it on the first write?
> >> >> Anything that relies on this behavior is, I think, unreliable, because
> >> >> the page could be made writable arbitrarily early by another program
> >> >> that changes nothing.
> >> >  We don't update timestamp when the page is written back. I believe this
> >> > is mostly because we don't know whether the data has been changed by a
> >> > write syscall, which already updated the timestamp, or by mmap. That is
> >> > also the reason why we update the timestamp at page fault time.
> >> >
> >> >  The reason why file_update_time() blocks for you is probably that it
> >> > needs to get access to buffer where inode is stored on disk and because a
> >> > transaction including this buffer is committing at the moment, your thread
> >> > has to wait until the transaction commit finishes. This is mostly a problem
> >> > specific to how ext4 works so e.g. xfs shouldn't have it.
> >> >
> >> >  Generally I believe the attempts to achieve any RT-like latencies when
> >> > writing to a filesystem are rather hopeless. How much hopeless depends on
> >> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
> >> > can imagine some tweaks could reduce your latencies to an acceptable level
> >> > but once the disk gets loaded you'll be screwed). So I'd suggest that
> >> > having RT thread just store log in memory (or write to a pipe) and have
> >> > another non-RT thread write the data to disk would be a much more robust
> >> > design.
> >>
> >> Windows seems to do pretty well at this, and I think it should be fixable on
> >> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
> >> page_mkclean_one.  The fallout from that might be unpleasant, though, but
> >> it would probably speed up a number of workloads.
> >  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
> > unlikely to go away in a forseeable future. The reason is that we need to
> > reliably account the number of dirty pages so that we can throttle
> > processes that dirty too much of memory and also protect agaist system
> > going into out-of-memory problems when too many pages would be dirty (and
> > thus hard to reclaim). Thus we create clean pages as write-protected, when
> > they are first written to, we account them as dirtied and unprotect them.
> > When pages are cleaned by writeback, we decrement number of dirty pages
> > accordingly and write-protect them again.
> 
> What about skipping pte_wrprotect for mlocked pages and continuing to
> account them dirty even if they're actually clean?  This should be a
> straightforward patch except for the effect on stable pages for
> writeback.  (It would also have unfortunate side effects on
> ctime/mtime without my other patch to rearrange that code.)
  Well, doing proper dirty accounting would be a mess (you'd have to
unaccount dirty pages during munlock etc.) and I'm not sure what all would
break when page writes would not be coupled with page faults. So I don't
think it's really worth it.

Avoiding IO during a minor fault would be a decent thing which might be
worth pursuing. As you properly noted "stable pages during writeback"
requirement is one obstacle which won't be that trivial to avoid though...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-11-01 23:03                   ` Jan Kara
  (?)
@ 2011-11-01 23:10                     ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-01 23:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 31-10-11 16:14:47, Andy Lutomirski wrote:
>> On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
>> >> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
>> >> >>  - Why are we calling file_update_time at all?  Presumably we also
>> >> >> update the time when the page is written back (if not, that sounds
>> >> >> like a bug, since the contents may be changed after something saw the
>> >> >> mtime update), and, if so, why bother updating it on the first write?
>> >> >> Anything that relies on this behavior is, I think, unreliable, because
>> >> >> the page could be made writable arbitrarily early by another program
>> >> >> that changes nothing.
>> >> >  We don't update timestamp when the page is written back. I believe this
>> >> > is mostly because we don't know whether the data has been changed by a
>> >> > write syscall, which already updated the timestamp, or by mmap. That is
>> >> > also the reason why we update the timestamp at page fault time.
>> >> >
>> >> >  The reason why file_update_time() blocks for you is probably that it
>> >> > needs to get access to buffer where inode is stored on disk and because a
>> >> > transaction including this buffer is committing at the moment, your thread
>> >> > has to wait until the transaction commit finishes. This is mostly a problem
>> >> > specific to how ext4 works so e.g. xfs shouldn't have it.
>> >> >
>> >> >  Generally I believe the attempts to achieve any RT-like latencies when
>> >> > writing to a filesystem are rather hopeless. How much hopeless depends on
>> >> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
>> >> > can imagine some tweaks could reduce your latencies to an acceptable level
>> >> > but once the disk gets loaded you'll be screwed). So I'd suggest that
>> >> > having RT thread just store log in memory (or write to a pipe) and have
>> >> > another non-RT thread write the data to disk would be a much more robust
>> >> > design.
>> >>
>> >> Windows seems to do pretty well at this, and I think it should be fixable on
>> >> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
>> >> page_mkclean_one.  The fallout from that might be unpleasant, though, but
>> >> it would probably speed up a number of workloads.
>> >  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
>> > unlikely to go away in a forseeable future. The reason is that we need to
>> > reliably account the number of dirty pages so that we can throttle
>> > processes that dirty too much of memory and also protect agaist system
>> > going into out-of-memory problems when too many pages would be dirty (and
>> > thus hard to reclaim). Thus we create clean pages as write-protected, when
>> > they are first written to, we account them as dirtied and unprotect them.
>> > When pages are cleaned by writeback, we decrement number of dirty pages
>> > accordingly and write-protect them again.
>>
>> What about skipping pte_wrprotect for mlocked pages and continuing to
>> account them dirty even if they're actually clean?  This should be a
>> straightforward patch except for the effect on stable pages for
>> writeback.  (It would also have unfortunate side effects on
>> ctime/mtime without my other patch to rearrange that code.)
>  Well, doing proper dirty accounting would be a mess (you'd have to
> unaccount dirty pages during munlock etc.) and I'm not sure what all would
> break when page writes would not be coupled with page faults. So I don't
> think it's really worth it.

I'll add it to my back burner.  I haven't figured out all (any?) of
the accounting yet.

>
> Avoiding IO during a minor fault would be a decent thing which might be
> worth pursuing. As you properly noted "stable pages during writeback"
> requirement is one obstacle which won't be that trivial to avoid though...

There's an easy solution that would be good enough for me: add a mount
option to turn off stable pages.

Is the other problem just a race, perhaps?  __block_page_mkwrite calls
__block_write_begin (which calls get_block, which I think is where the
latency comes from) *before* wait_on_page_writeback, which means that
there might not be any space allocated yet.

--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-11-01 23:10                     ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-01 23:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 31-10-11 16:14:47, Andy Lutomirski wrote:
>> On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
>> >> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
>> >> >>  - Why are we calling file_update_time at all?  Presumably we also
>> >> >> update the time when the page is written back (if not, that sounds
>> >> >> like a bug, since the contents may be changed after something saw the
>> >> >> mtime update), and, if so, why bother updating it on the first write?
>> >> >> Anything that relies on this behavior is, I think, unreliable, because
>> >> >> the page could be made writable arbitrarily early by another program
>> >> >> that changes nothing.
>> >> >  We don't update timestamp when the page is written back. I believe this
>> >> > is mostly because we don't know whether the data has been changed by a
>> >> > write syscall, which already updated the timestamp, or by mmap. That is
>> >> > also the reason why we update the timestamp at page fault time.
>> >> >
>> >> >  The reason why file_update_time() blocks for you is probably that it
>> >> > needs to get access to buffer where inode is stored on disk and because a
>> >> > transaction including this buffer is committing at the moment, your thread
>> >> > has to wait until the transaction commit finishes. This is mostly a problem
>> >> > specific to how ext4 works so e.g. xfs shouldn't have it.
>> >> >
>> >> >  Generally I believe the attempts to achieve any RT-like latencies when
>> >> > writing to a filesystem are rather hopeless. How much hopeless depends on
>> >> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
>> >> > can imagine some tweaks could reduce your latencies to an acceptable level
>> >> > but once the disk gets loaded you'll be screwed). So I'd suggest that
>> >> > having RT thread just store log in memory (or write to a pipe) and have
>> >> > another non-RT thread write the data to disk would be a much more robust
>> >> > design.
>> >>
>> >> Windows seems to do pretty well at this, and I think it should be fixable on
>> >> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
>> >> page_mkclean_one.  The fallout from that might be unpleasant, though, but
>> >> it would probably speed up a number of workloads.
>> >  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
>> > unlikely to go away in a forseeable future. The reason is that we need to
>> > reliably account the number of dirty pages so that we can throttle
>> > processes that dirty too much of memory and also protect agaist system
>> > going into out-of-memory problems when too many pages would be dirty (and
>> > thus hard to reclaim). Thus we create clean pages as write-protected, when
>> > they are first written to, we account them as dirtied and unprotect them.
>> > When pages are cleaned by writeback, we decrement number of dirty pages
>> > accordingly and write-protect them again.
>>
>> What about skipping pte_wrprotect for mlocked pages and continuing to
>> account them dirty even if they're actually clean?  This should be a
>> straightforward patch except for the effect on stable pages for
>> writeback.  (It would also have unfortunate side effects on
>> ctime/mtime without my other patch to rearrange that code.)
>  Well, doing proper dirty accounting would be a mess (you'd have to
> unaccount dirty pages during munlock etc.) and I'm not sure what all would
> break when page writes would not be coupled with page faults. So I don't
> think it's really worth it.

I'll add it to my back burner.  I haven't figured out all (any?) of
the accounting yet.

>
> Avoiding IO during a minor fault would be a decent thing which might be
> worth pursuing. As you properly noted "stable pages during writeback"
> requirement is one obstacle which won't be that trivial to avoid though...

There's an easy solution that would be good enough for me: add a mount
option to turn off stable pages.

Is the other problem just a race, perhaps?  __block_page_mkwrite calls
__block_write_begin (which calls get_block, which I think is where the
latency comes from) *before* wait_on_page_writeback, which means that
there might not be any space allocated yet.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-11-01 23:10                     ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-01 23:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 31-10-11 16:14:47, Andy Lutomirski wrote:
>> On Mon, Oct 31, 2011 at 4:10 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Fri 28-10-11 16:37:03, Andy Lutomirski wrote:
>> >> On Tue, Oct 25, 2011 at 5:26 AM, Jan Kara <jack@suse.cz> wrote:
>> >> >>  - Why are we calling file_update_time at all?  Presumably we also
>> >> >> update the time when the page is written back (if not, that sounds
>> >> >> like a bug, since the contents may be changed after something saw the
>> >> >> mtime update), and, if so, why bother updating it on the first write?
>> >> >> Anything that relies on this behavior is, I think, unreliable, because
>> >> >> the page could be made writable arbitrarily early by another program
>> >> >> that changes nothing.
>> >> >  We don't update timestamp when the page is written back. I believe this
>> >> > is mostly because we don't know whether the data has been changed by a
>> >> > write syscall, which already updated the timestamp, or by mmap. That is
>> >> > also the reason why we update the timestamp at page fault time.
>> >> >
>> >> >  The reason why file_update_time() blocks for you is probably that it
>> >> > needs to get access to buffer where inode is stored on disk and because a
>> >> > transaction including this buffer is committing at the moment, your thread
>> >> > has to wait until the transaction commit finishes. This is mostly a problem
>> >> > specific to how ext4 works so e.g. xfs shouldn't have it.
>> >> >
>> >> >  Generally I believe the attempts to achieve any RT-like latencies when
>> >> > writing to a filesystem are rather hopeless. How much hopeless depends on
>> >> > the load of the filesystem (e.g., in your case of mostly idle filesystem I
>> >> > can imagine some tweaks could reduce your latencies to an acceptable level
>> >> > but once the disk gets loaded you'll be screwed). So I'd suggest that
>> >> > having RT thread just store log in memory (or write to a pipe) and have
>> >> > another non-RT thread write the data to disk would be a much more robust
>> >> > design.
>> >>
>> >> Windows seems to do pretty well at this, and I think it should be fixable on
>> >> Linux too.  "All" that needs to be done is to remove the pte_wrprotect from
>> >> page_mkclean_one.  The fallout from that might be unpleasant, though, but
>> >> it would probably speed up a number of workloads.
>> >  Well, but Linux's mm pretty much depends the pte_wrprotect() so that's
>> > unlikely to go away in a forseeable future. The reason is that we need to
>> > reliably account the number of dirty pages so that we can throttle
>> > processes that dirty too much of memory and also protect agaist system
>> > going into out-of-memory problems when too many pages would be dirty (and
>> > thus hard to reclaim). Thus we create clean pages as write-protected, when
>> > they are first written to, we account them as dirtied and unprotect them.
>> > When pages are cleaned by writeback, we decrement number of dirty pages
>> > accordingly and write-protect them again.
>>
>> What about skipping pte_wrprotect for mlocked pages and continuing to
>> account them dirty even if they're actually clean?  This should be a
>> straightforward patch except for the effect on stable pages for
>> writeback.  (It would also have unfortunate side effects on
>> ctime/mtime without my other patch to rearrange that code.)
>  Well, doing proper dirty accounting would be a mess (you'd have to
> unaccount dirty pages during munlock etc.) and I'm not sure what all would
> break when page writes would not be coupled with page faults. So I don't
> think it's really worth it.

I'll add it to my back burner.  I haven't figured out all (any?) of
the accounting yet.

>
> Avoiding IO during a minor fault would be a decent thing which might be
> worth pursuing. As you properly noted "stable pages during writeback"
> requirement is one obstacle which won't be that trivial to avoid though...

There's an easy solution that would be good enough for me: add a mount
option to turn off stable pages.

Is the other problem just a race, perhaps?  __block_page_mkwrite calls
__block_write_begin (which calls get_block, which I think is where the
latency comes from) *before* wait_on_page_writeback, which means that
there might not be any space allocated yet.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-11-01 23:10                     ` Andy Lutomirski
  (?)
@ 2011-11-02  1:51                       ` Andy Lutomirski
  -1 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-02  1:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue, Nov 1, 2011 at 4:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:

>>
>> Avoiding IO during a minor fault would be a decent thing which might be
>> worth pursuing. As you properly noted "stable pages during writeback"
>> requirement is one obstacle which won't be that trivial to avoid though...
>
> There's an easy solution that would be good enough for me: add a mount
> option to turn off stable pages.
>
> Is the other problem just a race, perhaps?  __block_page_mkwrite calls
> __block_write_begin (which calls get_block, which I think is where the
> latency comes from) *before* wait_on_page_writeback, which means that
> there might not be any space allocated yet.

I think I'm right (other than calling it a race).  If I change my code to do:

- map the file (with MCL_FUTURE set)
- fallocate
- dirty all pages
- fsync
- dirty all pages again

in the non-real-time thread, then a short test that was a mediocre
reproducer seems to work.

This is annoying, though -- I'm not generating twice as much write I/O
as I used to.  Is there any way to force the delalloc code to do its
thing without triggering writeback?  I don't think fallocate has this
effect.

--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-11-02  1:51                       ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-02  1:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue, Nov 1, 2011 at 4:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:

>>
>> Avoiding IO during a minor fault would be a decent thing which might be
>> worth pursuing. As you properly noted "stable pages during writeback"
>> requirement is one obstacle which won't be that trivial to avoid though...
>
> There's an easy solution that would be good enough for me: add a mount
> option to turn off stable pages.
>
> Is the other problem just a race, perhaps?  __block_page_mkwrite calls
> __block_write_begin (which calls get_block, which I think is where the
> latency comes from) *before* wait_on_page_writeback, which means that
> there might not be any space allocated yet.

I think I'm right (other than calling it a race).  If I change my code to do:

- map the file (with MCL_FUTURE set)
- fallocate
- dirty all pages
- fsync
- dirty all pages again

in the non-real-time thread, then a short test that was a mediocre
reproducer seems to work.

This is annoying, though -- I'm not generating twice as much write I/O
as I used to.  Is there any way to force the delalloc code to do its
thing without triggering writeback?  I don't think fallocate has this
effect.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-11-02  1:51                       ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2011-11-02  1:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue, Nov 1, 2011 at 4:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:

>>
>> Avoiding IO during a minor fault would be a decent thing which might be
>> worth pursuing. As you properly noted "stable pages during writeback"
>> requirement is one obstacle which won't be that trivial to avoid though...
>
> There's an easy solution that would be good enough for me: add a mount
> option to turn off stable pages.
>
> Is the other problem just a race, perhaps?  __block_page_mkwrite calls
> __block_write_begin (which calls get_block, which I think is where the
> latency comes from) *before* wait_on_page_writeback, which means that
> there might not be any space allocated yet.

I think I'm right (other than calling it a race).  If I change my code to do:

- map the file (with MCL_FUTURE set)
- fallocate
- dirty all pages
- fsync
- dirty all pages again

in the non-real-time thread, then a short test that was a mediocre
reproducer seems to work.

This is annoying, though -- I'm not generating twice as much write I/O
as I used to.  Is there any way to force the delalloc code to do its
thing without triggering writeback?  I don't think fallocate has this
effect.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
  2011-11-01 23:02                   ` Andy Lutomirski
@ 2011-11-02  7:38                     ` Christoph Hellwig
  -1 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2011-11-02  7:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4

On Tue, Nov 01, 2011 at 04:02:24PM -0700, Andy Lutomirski wrote:
> Hmm.  Isn't it permitted to at least read from an fs while holding the
> page lock?  I thought that the page lock was held for the entire
> duration of a read and at the beginning of writeback.
> 
> I can push this down to the ->writepage implementations or to the
> clear_page_dirty_for_io callers, but that will result in a bigger
> patch.

Besides the current way that seems to be the only reasonable place to
do it.  Pushing it into ->writepage also has the benefit that
filesystems could piggy back the ctime update onto the transaction that
updates the extent tree.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
@ 2011-11-02  7:38                     ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2011-11-02  7:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4

On Tue, Nov 01, 2011 at 04:02:24PM -0700, Andy Lutomirski wrote:
> Hmm.  Isn't it permitted to at least read from an fs while holding the
> page lock?  I thought that the page lock was held for the entire
> duration of a read and at the beginning of writeback.
> 
> I can push this down to the ->writepage implementations or to the
> clear_page_dirty_for_io callers, but that will result in a bigger
> patch.

Besides the current way that seems to be the only reasonable place to
do it.  Pushing it into ->writepage also has the benefit that
filesystems could piggy back the ctime update onto the transaction that
updates the extent tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
  2011-11-01 23:02                   ` Andy Lutomirski
  (?)
@ 2011-11-02 15:02                     ` Jan Kara
  -1 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-02 15:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4

On Tue 01-11-11 16:02:24, Andy Lutomirski wrote:
> On Tue, Nov 1, 2011 at 3:53 PM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-10-11 16:39:25, Andy Lutomirski wrote:
> >> We used to update a file's time on do_wp_page.  This caused latency
> >> whenever file_update_time would sleep (this happens on ext4).  It is
> >> also, IMO, less than ideal: any copy, backup, or 'make' run taken
> >> after do_wp_page but before an mmap user finished writing would see
> >> the new timestamp but the old contents.
> >>
> >> With this patch, cmtime is updated after a page is written.  When the
> >> mm code transfers the dirty bit from a pte to the associated struct
> >> page, it also sets a new page flag indicating that the page was
> >> modified directly from userspace.  The inode's time is then updated in
> >> clear_page_dirty_for_io.
> >>
> >> We can't update cmtime in all contexts in which ptes are unmapped:
> >> various reclaim paths can unmap ptes from GFP_NOFS paths.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>
> >> I'm not thrilled about using a page flag for this, but I haven't
> >> spotted a better way.  Updating the time in writepage would touch
> >> a lot of files and would interact oddly with write.
> >  I see two problems with this patch:
> > 1) Using a page flags is really a no-go. We are rather short on page flags
> > so using them for such minor thing is a real wastage. Moreover it should be
> > rather easy to just use an inode flag instead.
> 
> Am I allowed to set inode flags without holding any locks?
  That's a good question. Locking of i_flags was always kind of unclear to
me. They are certainly read without any locks and in the couple of places
where setting can actually race VFS uses i_mutex for serialization which is
kind of overkill (and unusable from page fault due to locking constraints).
Probably using atomic bitops for setting i_flags would be needed.

> > 2) You cannot call inode_update_time_writable() from
> > clear_page_dirty_for_io() because that is called under a page lock and thus
> > would create lock inversion problems.
> 
> Hmm.  Isn't it permitted to at least read from an fs while holding the
> page lock?  I thought that the page lock was held for the entire
> duration of a read and at the beginning of writeback.
  You are right that page lock is held during the whole ->readpage() call.
But that does not mean any reading can be done while page lock is held...
Page lock is also held during the ->writepage() call but that is one of
reasons why several filesystems ignore that callback and use ->writepages()
callback which allows them to do some fs internal locking before taking the
page lock.

> I can push this down to the ->writepage implementations or to the
> clear_page_dirty_for_io callers, but that will result in a bigger
> patch.
  Yes, I think this might be a way to go. Actually using
block_write_full_page() callback for updating times might somewhat reduce
number of filesystems that need to be modified...

								Honza

> >>  fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
> >>  include/linux/fs.h         |    1 +
> >>  include/linux/page-flags.h |    5 ++++
> >>  mm/page-writeback.c        |   19 +++++++++++++---
> >>  mm/rmap.c                  |   18 +++++++++++++-
> >>  5 files changed, 72 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index ec79246..ee93a25 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
> >>  }
> >>  EXPORT_SYMBOL(touch_atime);
> >>
> >> -/**
> >> - *   file_update_time        -       update mtime and ctime time
> >> - *   @file: file accessed
> >> - *
> >> - *   Update the mtime and ctime members of an inode and mark the inode
> >> - *   for writeback.  Note that this function is meant exclusively for
> >> - *   usage in the file write path of filesystems, and filesystems may
> >> - *   choose to explicitly ignore update via this function with the
> >> - *   S_NOCMTIME inode flag, e.g. for network filesystem where these
> >> - *   timestamps are handled by the server.
> >> - */
> >> -
> >> -void file_update_time(struct file *file)
> >> +static void do_inode_update_time(struct file *file, struct inode *inode)
> >>  {
> >> -     struct inode *inode = file->f_path.dentry->d_inode;
> >>       struct timespec now;
> >>       enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
> >>
> >> @@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
> >>               return;
> >>
> >>       /* Finally allowed to write? Takes lock. */
> >> -     if (mnt_want_write_file(file))
> >> +     if (file && mnt_want_write_file(file))
> >>               return;
> >>
> >>       /* Only change inode inside the lock region */
> >> @@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
> >>       if (sync_it & S_MTIME)
> >>               inode->i_mtime = now;
> >>       mark_inode_dirty_sync(inode);
> >> -     mnt_drop_write(file->f_path.mnt);
> >> +
> >> +     if (file)
> >> +             mnt_drop_write(file->f_path.mnt);
> >> +}
> >> +
> >> +/**
> >> + *   file_update_time        -       update mtime and ctime time
> >> + *   @file: file accessed
> >> + *
> >> + *   Update the mtime and ctime members of an inode and mark the inode
> >> + *   for writeback.  Note that this function is meant exclusively for
> >> + *   usage in the file write path of filesystems, and filesystems may
> >> + *   choose to explicitly ignore update via this function with the
> >> + *   S_NOCMTIME inode flag, e.g. for network filesystem where these
> >> + *   timestamps are handled by the server.
> >> + */
> >> +
> >> +void file_update_time(struct file *file)
> >> +{
> >> +     do_inode_update_time(file, file->f_path.dentry->d_inode);
> >>  }
> >>  EXPORT_SYMBOL(file_update_time);
> >>
> >> +/**
> >> + *   inode_update_time_writable      -       update mtime and ctime
> >> + *   @inode: inode accessed
> >> + *
> >> + *   Same as file_update_time, except that the caller is responsible
> >> + *   for checking that the mount is writable.
> >> + */
> >> +
> >> +void inode_update_time_writable(struct inode *inode)
> >> +{
> >> +     do_inode_update_time(0, inode);
> >> +}
> >> +
> >>  int inode_needs_sync(struct inode *inode)
> >>  {
> >>       if (IS_SYNC(inode))
> >> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >> index 277f497..9e28927 100644
> >> --- a/include/linux/fs.h
> >> +++ b/include/linux/fs.h
> >> @@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
> >>  extern void setattr_copy(struct inode *inode, const struct iattr *attr);
> >>
> >>  extern void file_update_time(struct file *file);
> >> +extern void inode_update_time_writable(struct inode *inode);
> >>
> >>  extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
> >>  extern void save_mount_options(struct super_block *sb, char *options);
> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> index e90a673..4eed012 100644
> >> --- a/include/linux/page-flags.h
> >> +++ b/include/linux/page-flags.h
> >> @@ -107,6 +107,7 @@ enum pageflags {
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>       PG_compound_lock,
> >>  #endif
> >> +     PG_update_cmtime,       /* Dirtied via writable mapping. */
> >>       __NR_PAGEFLAGS,
> >>
> >>       /* Filesystems */
> >> @@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
> >>  #define __PG_HWPOISON 0
> >>  #endif
> >>
> >> +/* Whoever clears PG_update_cmtime must update the cmtime. */
> >> +SETPAGEFLAG(UpdateCMTime, update_cmtime)
> >> +TESTCLEARFLAG(UpdateCMTime, update_cmtime)
> >> +
> >>  u64 stable_page_flags(struct page *page);
> >>
> >>  static inline int PageUptodate(struct page *page)
> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> index 0e309cd..41c48ea 100644
> >> --- a/mm/page-writeback.c
> >> +++ b/mm/page-writeback.c
> >> @@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> >>
> >>  /*
> >>   * Clear a page's dirty flag, while caring for dirty memory accounting.
> >> - * Returns true if the page was previously dirty.
> >> + * Returns true if the page was previously dirty.  Also updates inode time
> >> + * if necessary.
> >>   *
> >>   * This is for preparing to put the page under writeout.  We leave the page
> >>   * tagged as dirty in the radix tree so that a concurrent write-for-sync
> >> @@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> >>   */
> >>  int clear_page_dirty_for_io(struct page *page)
> >>  {
> >> +     int ret;
> >>       struct address_space *mapping = page_mapping(page);
> >>
> >>       BUG_ON(!PageLocked(page));
> >> @@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
> >>                       dec_zone_page_state(page, NR_FILE_DIRTY);
> >>                       dec_bdi_stat(mapping->backing_dev_info,
> >>                                       BDI_RECLAIMABLE);
> >> -                     return 1;
> >> +                     ret = 1;
> >> +                     goto out;
> >>               }
> >> -             return 0;
> >> +             ret = 0;
> >> +             goto out;
> >>       }
> >> -     return TestClearPageDirty(page);
> >> +     ret = TestClearPageDirty(page);
> >> +
> >> +out:
> >> +     /* We know that the inode (if any) is on a writable mount. */
> >> +     if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
> >> +             inode_update_time_writable(mapping->host);
> >> +
> >> +     return ret;
> >>  }
> >>  EXPORT_SYMBOL(clear_page_dirty_for_io);
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 8005080..2ee595d 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
> >>               struct address_space *mapping = page_mapping(page);
> >>               if (mapping) {
> >>                       ret = page_mkclean_file(mapping, page);
> >> +
> >> +                     /*
> >> +                      * If dirtied via shared writable mapping, cmtime
> >> +                      * needs to be updated.  If dirtied only through
> >> +                      * write(), etc, then the writer already updated
> >> +                      * cmtime.
> >> +                      */
> >> +                     if (ret)
> >> +                             SetPageUpdateCMTime(page);
> >> +
> >>                       if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> >>                               ret = 1;
> >>               }
> >> @@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>       pteval = ptep_clear_flush_notify(vma, address, pte);
> >>
> >>       /* Move the dirty bit to the physical page now the pte is gone. */
> >> -     if (pte_dirty(pteval))
> >> +     if (pte_dirty(pteval)) {
> >> +             SetPageUpdateCMTime(page);
> >>               set_page_dirty(page);
> >> +     }
> >>
> >>       /* Update high watermark before we lower rss */
> >>       update_hiwater_rss(mm);
> >> @@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >>                       set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
> >>
> >>               /* Move the dirty bit to the physical page now the pte is gone. */
> >> -             if (pte_dirty(pteval))
> >> +             if (pte_dirty(pteval)) {
> >> +                     SetPageUpdateCMTime(page);
> >>                       set_page_dirty(page);
> >> +             }
> >>
> >>               page_remove_rmap(page);
> >>               page_cache_release(page);
> >> --
> >> 1.7.6.4
> >>
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
@ 2011-11-02 15:02                     ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-02 15:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4

On Tue 01-11-11 16:02:24, Andy Lutomirski wrote:
> On Tue, Nov 1, 2011 at 3:53 PM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-10-11 16:39:25, Andy Lutomirski wrote:
> >> We used to update a file's time on do_wp_page.  This caused latency
> >> whenever file_update_time would sleep (this happens on ext4).  It is
> >> also, IMO, less than ideal: any copy, backup, or 'make' run taken
> >> after do_wp_page but before an mmap user finished writing would see
> >> the new timestamp but the old contents.
> >>
> >> With this patch, cmtime is updated after a page is written.  When the
> >> mm code transfers the dirty bit from a pte to the associated struct
> >> page, it also sets a new page flag indicating that the page was
> >> modified directly from userspace.  The inode's time is then updated in
> >> clear_page_dirty_for_io.
> >>
> >> We can't update cmtime in all contexts in which ptes are unmapped:
> >> various reclaim paths can unmap ptes from GFP_NOFS paths.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>
> >> I'm not thrilled about using a page flag for this, but I haven't
> >> spotted a better way.  Updating the time in writepage would touch
> >> a lot of files and would interact oddly with write.
> >  I see two problems with this patch:
> > 1) Using a page flags is really a no-go. We are rather short on page flags
> > so using them for such minor thing is a real wastage. Moreover it should be
> > rather easy to just use an inode flag instead.
> 
> Am I allowed to set inode flags without holding any locks?
  That's a good question. Locking of i_flags was always kind of unclear to
me. They are certainly read without any locks and in the couple of places
where setting can actually race VFS uses i_mutex for serialization which is
kind of overkill (and unusable from page fault due to locking constraints).
Probably using atomic bitops for setting i_flags would be needed.

> > 2) You cannot call inode_update_time_writable() from
> > clear_page_dirty_for_io() because that is called under a page lock and thus
> > would create lock inversion problems.
> 
> Hmm.  Isn't it permitted to at least read from an fs while holding the
> page lock?  I thought that the page lock was held for the entire
> duration of a read and at the beginning of writeback.
  You are right that page lock is held during the whole ->readpage() call.
But that does not mean any reading can be done while page lock is held...
Page lock is also held during the ->writepage() call but that is one of
reasons why several filesystems ignore that callback and use ->writepages()
callback which allows them to do some fs internal locking before taking the
page lock.

> I can push this down to the ->writepage implementations or to the
> clear_page_dirty_for_io callers, but that will result in a bigger
> patch.
  Yes, I think this might be a way to go. Actually using
block_write_full_page() callback for updating times might somewhat reduce
number of filesystems that need to be modified...

								Honza

> >>  fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
> >>  include/linux/fs.h         |    1 +
> >>  include/linux/page-flags.h |    5 ++++
> >>  mm/page-writeback.c        |   19 +++++++++++++---
> >>  mm/rmap.c                  |   18 +++++++++++++-
> >>  5 files changed, 72 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index ec79246..ee93a25 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
> >>  }
> >>  EXPORT_SYMBOL(touch_atime);
> >>
> >> -/**
> >> - *   file_update_time        -       update mtime and ctime time
> >> - *   @file: file accessed
> >> - *
> >> - *   Update the mtime and ctime members of an inode and mark the inode
> >> - *   for writeback.  Note that this function is meant exclusively for
> >> - *   usage in the file write path of filesystems, and filesystems may
> >> - *   choose to explicitly ignore update via this function with the
> >> - *   S_NOCMTIME inode flag, e.g. for network filesystem where these
> >> - *   timestamps are handled by the server.
> >> - */
> >> -
> >> -void file_update_time(struct file *file)
> >> +static void do_inode_update_time(struct file *file, struct inode *inode)
> >>  {
> >> -     struct inode *inode = file->f_path.dentry->d_inode;
> >>       struct timespec now;
> >>       enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
> >>
> >> @@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
> >>               return;
> >>
> >>       /* Finally allowed to write? Takes lock. */
> >> -     if (mnt_want_write_file(file))
> >> +     if (file && mnt_want_write_file(file))
> >>               return;
> >>
> >>       /* Only change inode inside the lock region */
> >> @@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
> >>       if (sync_it & S_MTIME)
> >>               inode->i_mtime = now;
> >>       mark_inode_dirty_sync(inode);
> >> -     mnt_drop_write(file->f_path.mnt);
> >> +
> >> +     if (file)
> >> +             mnt_drop_write(file->f_path.mnt);
> >> +}
> >> +
> >> +/**
> >> + *   file_update_time        -       update mtime and ctime time
> >> + *   @file: file accessed
> >> + *
> >> + *   Update the mtime and ctime members of an inode and mark the inode
> >> + *   for writeback.  Note that this function is meant exclusively for
> >> + *   usage in the file write path of filesystems, and filesystems may
> >> + *   choose to explicitly ignore update via this function with the
> >> + *   S_NOCMTIME inode flag, e.g. for network filesystem where these
> >> + *   timestamps are handled by the server.
> >> + */
> >> +
> >> +void file_update_time(struct file *file)
> >> +{
> >> +     do_inode_update_time(file, file->f_path.dentry->d_inode);
> >>  }
> >>  EXPORT_SYMBOL(file_update_time);
> >>
> >> +/**
> >> + *   inode_update_time_writable      -       update mtime and ctime
> >> + *   @inode: inode accessed
> >> + *
> >> + *   Same as file_update_time, except that the caller is responsible
> >> + *   for checking that the mount is writable.
> >> + */
> >> +
> >> +void inode_update_time_writable(struct inode *inode)
> >> +{
> >> +     do_inode_update_time(0, inode);
> >> +}
> >> +
> >>  int inode_needs_sync(struct inode *inode)
> >>  {
> >>       if (IS_SYNC(inode))
> >> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >> index 277f497..9e28927 100644
> >> --- a/include/linux/fs.h
> >> +++ b/include/linux/fs.h
> >> @@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
> >>  extern void setattr_copy(struct inode *inode, const struct iattr *attr);
> >>
> >>  extern void file_update_time(struct file *file);
> >> +extern void inode_update_time_writable(struct inode *inode);
> >>
> >>  extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
> >>  extern void save_mount_options(struct super_block *sb, char *options);
> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> index e90a673..4eed012 100644
> >> --- a/include/linux/page-flags.h
> >> +++ b/include/linux/page-flags.h
> >> @@ -107,6 +107,7 @@ enum pageflags {
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>       PG_compound_lock,
> >>  #endif
> >> +     PG_update_cmtime,       /* Dirtied via writable mapping. */
> >>       __NR_PAGEFLAGS,
> >>
> >>       /* Filesystems */
> >> @@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
> >>  #define __PG_HWPOISON 0
> >>  #endif
> >>
> >> +/* Whoever clears PG_update_cmtime must update the cmtime. */
> >> +SETPAGEFLAG(UpdateCMTime, update_cmtime)
> >> +TESTCLEARFLAG(UpdateCMTime, update_cmtime)
> >> +
> >>  u64 stable_page_flags(struct page *page);
> >>
> >>  static inline int PageUptodate(struct page *page)
> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> index 0e309cd..41c48ea 100644
> >> --- a/mm/page-writeback.c
> >> +++ b/mm/page-writeback.c
> >> @@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> >>
> >>  /*
> >>   * Clear a page's dirty flag, while caring for dirty memory accounting.
> >> - * Returns true if the page was previously dirty.
> >> + * Returns true if the page was previously dirty.  Also updates inode time
> >> + * if necessary.
> >>   *
> >>   * This is for preparing to put the page under writeout.  We leave the page
> >>   * tagged as dirty in the radix tree so that a concurrent write-for-sync
> >> @@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> >>   */
> >>  int clear_page_dirty_for_io(struct page *page)
> >>  {
> >> +     int ret;
> >>       struct address_space *mapping = page_mapping(page);
> >>
> >>       BUG_ON(!PageLocked(page));
> >> @@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
> >>                       dec_zone_page_state(page, NR_FILE_DIRTY);
> >>                       dec_bdi_stat(mapping->backing_dev_info,
> >>                                       BDI_RECLAIMABLE);
> >> -                     return 1;
> >> +                     ret = 1;
> >> +                     goto out;
> >>               }
> >> -             return 0;
> >> +             ret = 0;
> >> +             goto out;
> >>       }
> >> -     return TestClearPageDirty(page);
> >> +     ret = TestClearPageDirty(page);
> >> +
> >> +out:
> >> +     /* We know that the inode (if any) is on a writable mount. */
> >> +     if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
> >> +             inode_update_time_writable(mapping->host);
> >> +
> >> +     return ret;
> >>  }
> >>  EXPORT_SYMBOL(clear_page_dirty_for_io);
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 8005080..2ee595d 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
> >>               struct address_space *mapping = page_mapping(page);
> >>               if (mapping) {
> >>                       ret = page_mkclean_file(mapping, page);
> >> +
> >> +                     /*
> >> +                      * If dirtied via shared writable mapping, cmtime
> >> +                      * needs to be updated.  If dirtied only through
> >> +                      * write(), etc, then the writer already updated
> >> +                      * cmtime.
> >> +                      */
> >> +                     if (ret)
> >> +                             SetPageUpdateCMTime(page);
> >> +
> >>                       if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> >>                               ret = 1;
> >>               }
> >> @@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>       pteval = ptep_clear_flush_notify(vma, address, pte);
> >>
> >>       /* Move the dirty bit to the physical page now the pte is gone. */
> >> -     if (pte_dirty(pteval))
> >> +     if (pte_dirty(pteval)) {
> >> +             SetPageUpdateCMTime(page);
> >>               set_page_dirty(page);
> >> +     }
> >>
> >>       /* Update high watermark before we lower rss */
> >>       update_hiwater_rss(mm);
> >> @@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >>                       set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
> >>
> >>               /* Move the dirty bit to the physical page now the pte is gone. */
> >> -             if (pte_dirty(pteval))
> >> +             if (pte_dirty(pteval)) {
> >> +                     SetPageUpdateCMTime(page);
> >>                       set_page_dirty(page);
> >> +             }
> >>
> >>               page_remove_rmap(page);
> >>               page_cache_release(page);
> >> --
> >> 1.7.6.4
> >>
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
@ 2011-11-02 15:02                     ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-02 15:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-fsdevel,
	linux-ext4

On Tue 01-11-11 16:02:24, Andy Lutomirski wrote:
> On Tue, Nov 1, 2011 at 3:53 PM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-10-11 16:39:25, Andy Lutomirski wrote:
> >> We used to update a file's time on do_wp_page.  This caused latency
> >> whenever file_update_time would sleep (this happens on ext4).  It is
> >> also, IMO, less than ideal: any copy, backup, or 'make' run taken
> >> after do_wp_page but before an mmap user finished writing would see
> >> the new timestamp but the old contents.
> >>
> >> With this patch, cmtime is updated after a page is written.  When the
> >> mm code transfers the dirty bit from a pte to the associated struct
> >> page, it also sets a new page flag indicating that the page was
> >> modified directly from userspace.  The inode's time is then updated in
> >> clear_page_dirty_for_io.
> >>
> >> We can't update cmtime in all contexts in which ptes are unmapped:
> >> various reclaim paths can unmap ptes from GFP_NOFS paths.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>
> >> I'm not thrilled about using a page flag for this, but I haven't
> >> spotted a better way.  Updating the time in writepage would touch
> >> a lot of files and would interact oddly with write.
> >  I see two problems with this patch:
> > 1) Using a page flags is really a no-go. We are rather short on page flags
> > so using them for such minor thing is a real wastage. Moreover it should be
> > rather easy to just use an inode flag instead.
> 
> Am I allowed to set inode flags without holding any locks?
  That's a good question. Locking of i_flags was always kind of unclear to
me. They are certainly read without any locks and in the couple of places
where setting can actually race VFS uses i_mutex for serialization which is
kind of overkill (and unusable from page fault due to locking constraints).
Probably using atomic bitops for setting i_flags would be needed.

> > 2) You cannot call inode_update_time_writable() from
> > clear_page_dirty_for_io() because that is called under a page lock and thus
> > would create lock inversion problems.
> 
> Hmm.  Isn't it permitted to at least read from an fs while holding the
> page lock?  I thought that the page lock was held for the entire
> duration of a read and at the beginning of writeback.
  You are right that page lock is held during the whole ->readpage() call.
But that does not mean any reading can be done while page lock is held...
Page lock is also held during the ->writepage() call but that is one of
reasons why several filesystems ignore that callback and use ->writepages()
callback which allows them to do some fs internal locking before taking the
page lock.

> I can push this down to the ->writepage implementations or to the
> clear_page_dirty_for_io callers, but that will result in a bigger
> patch.
  Yes, I think this might be a way to go. Actually using
block_write_full_page() callback for updating times might somewhat reduce
number of filesystems that need to be modified...

								Honza

> >>  fs/inode.c                 |   51 ++++++++++++++++++++++++++++++-------------
> >>  include/linux/fs.h         |    1 +
> >>  include/linux/page-flags.h |    5 ++++
> >>  mm/page-writeback.c        |   19 +++++++++++++---
> >>  mm/rmap.c                  |   18 +++++++++++++-
> >>  5 files changed, 72 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index ec79246..ee93a25 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -1461,21 +1461,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
> >>  }
> >>  EXPORT_SYMBOL(touch_atime);
> >>
> >> -/**
> >> - *   file_update_time        -       update mtime and ctime time
> >> - *   @file: file accessed
> >> - *
> >> - *   Update the mtime and ctime members of an inode and mark the inode
> >> - *   for writeback.  Note that this function is meant exclusively for
> >> - *   usage in the file write path of filesystems, and filesystems may
> >> - *   choose to explicitly ignore update via this function with the
> >> - *   S_NOCMTIME inode flag, e.g. for network filesystem where these
> >> - *   timestamps are handled by the server.
> >> - */
> >> -
> >> -void file_update_time(struct file *file)
> >> +static void do_inode_update_time(struct file *file, struct inode *inode)
> >>  {
> >> -     struct inode *inode = file->f_path.dentry->d_inode;
> >>       struct timespec now;
> >>       enum { S_MTIME = 1, S_CTIME = 2, S_VERSION = 4 } sync_it = 0;
> >>
> >> @@ -1497,7 +1484,7 @@ void file_update_time(struct file *file)
> >>               return;
> >>
> >>       /* Finally allowed to write? Takes lock. */
> >> -     if (mnt_want_write_file(file))
> >> +     if (file && mnt_want_write_file(file))
> >>               return;
> >>
> >>       /* Only change inode inside the lock region */
> >> @@ -1508,10 +1495,42 @@ void file_update_time(struct file *file)
> >>       if (sync_it & S_MTIME)
> >>               inode->i_mtime = now;
> >>       mark_inode_dirty_sync(inode);
> >> -     mnt_drop_write(file->f_path.mnt);
> >> +
> >> +     if (file)
> >> +             mnt_drop_write(file->f_path.mnt);
> >> +}
> >> +
> >> +/**
> >> + *   file_update_time        -       update mtime and ctime time
> >> + *   @file: file accessed
> >> + *
> >> + *   Update the mtime and ctime members of an inode and mark the inode
> >> + *   for writeback.  Note that this function is meant exclusively for
> >> + *   usage in the file write path of filesystems, and filesystems may
> >> + *   choose to explicitly ignore update via this function with the
> >> + *   S_NOCMTIME inode flag, e.g. for network filesystem where these
> >> + *   timestamps are handled by the server.
> >> + */
> >> +
> >> +void file_update_time(struct file *file)
> >> +{
> >> +     do_inode_update_time(file, file->f_path.dentry->d_inode);
> >>  }
> >>  EXPORT_SYMBOL(file_update_time);
> >>
> >> +/**
> >> + *   inode_update_time_writable      -       update mtime and ctime
> >> + *   @inode: inode accessed
> >> + *
> >> + *   Same as file_update_time, except that the caller is responsible
> >> + *   for checking that the mount is writable.
> >> + */
> >> +
> >> +void inode_update_time_writable(struct inode *inode)
> >> +{
> >> +     do_inode_update_time(0, inode);
> >> +}
> >> +
> >>  int inode_needs_sync(struct inode *inode)
> >>  {
> >>       if (IS_SYNC(inode))
> >> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >> index 277f497..9e28927 100644
> >> --- a/include/linux/fs.h
> >> +++ b/include/linux/fs.h
> >> @@ -2553,6 +2553,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
> >>  extern void setattr_copy(struct inode *inode, const struct iattr *attr);
> >>
> >>  extern void file_update_time(struct file *file);
> >> +extern void inode_update_time_writable(struct inode *inode);
> >>
> >>  extern int generic_show_options(struct seq_file *m, struct vfsmount *mnt);
> >>  extern void save_mount_options(struct super_block *sb, char *options);
> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> index e90a673..4eed012 100644
> >> --- a/include/linux/page-flags.h
> >> +++ b/include/linux/page-flags.h
> >> @@ -107,6 +107,7 @@ enum pageflags {
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>       PG_compound_lock,
> >>  #endif
> >> +     PG_update_cmtime,       /* Dirtied via writable mapping. */
> >>       __NR_PAGEFLAGS,
> >>
> >>       /* Filesystems */
> >> @@ -273,6 +274,10 @@ PAGEFLAG_FALSE(HWPoison)
> >>  #define __PG_HWPOISON 0
> >>  #endif
> >>
> >> +/* Whoever clears PG_update_cmtime must update the cmtime. */
> >> +SETPAGEFLAG(UpdateCMTime, update_cmtime)
> >> +TESTCLEARFLAG(UpdateCMTime, update_cmtime)
> >> +
> >>  u64 stable_page_flags(struct page *page);
> >>
> >>  static inline int PageUptodate(struct page *page)
> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> index 0e309cd..41c48ea 100644
> >> --- a/mm/page-writeback.c
> >> +++ b/mm/page-writeback.c
> >> @@ -1460,7 +1460,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> >>
> >>  /*
> >>   * Clear a page's dirty flag, while caring for dirty memory accounting.
> >> - * Returns true if the page was previously dirty.
> >> + * Returns true if the page was previously dirty.  Also updates inode time
> >> + * if necessary.
> >>   *
> >>   * This is for preparing to put the page under writeout.  We leave the page
> >>   * tagged as dirty in the radix tree so that a concurrent write-for-sync
> >> @@ -1474,6 +1475,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> >>   */
> >>  int clear_page_dirty_for_io(struct page *page)
> >>  {
> >> +     int ret;
> >>       struct address_space *mapping = page_mapping(page);
> >>
> >>       BUG_ON(!PageLocked(page));
> >> @@ -1520,11 +1522,20 @@ int clear_page_dirty_for_io(struct page *page)
> >>                       dec_zone_page_state(page, NR_FILE_DIRTY);
> >>                       dec_bdi_stat(mapping->backing_dev_info,
> >>                                       BDI_RECLAIMABLE);
> >> -                     return 1;
> >> +                     ret = 1;
> >> +                     goto out;
> >>               }
> >> -             return 0;
> >> +             ret = 0;
> >> +             goto out;
> >>       }
> >> -     return TestClearPageDirty(page);
> >> +     ret = TestClearPageDirty(page);
> >> +
> >> +out:
> >> +     /* We know that the inode (if any) is on a writable mount. */
> >> +     if (mapping && mapping->host && TestClearPageUpdateCMTime(page))
> >> +             inode_update_time_writable(mapping->host);
> >> +
> >> +     return ret;
> >>  }
> >>  EXPORT_SYMBOL(clear_page_dirty_for_io);
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 8005080..2ee595d 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -937,6 +937,16 @@ int page_mkclean(struct page *page)
> >>               struct address_space *mapping = page_mapping(page);
> >>               if (mapping) {
> >>                       ret = page_mkclean_file(mapping, page);
> >> +
> >> +                     /*
> >> +                      * If dirtied via shared writable mapping, cmtime
> >> +                      * needs to be updated.  If dirtied only through
> >> +                      * write(), etc, then the writer already updated
> >> +                      * cmtime.
> >> +                      */
> >> +                     if (ret)
> >> +                             SetPageUpdateCMTime(page);
> >> +
> >>                       if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> >>                               ret = 1;
> >>               }
> >> @@ -1203,8 +1213,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>       pteval = ptep_clear_flush_notify(vma, address, pte);
> >>
> >>       /* Move the dirty bit to the physical page now the pte is gone. */
> >> -     if (pte_dirty(pteval))
> >> +     if (pte_dirty(pteval)) {
> >> +             SetPageUpdateCMTime(page);
> >>               set_page_dirty(page);
> >> +     }
> >>
> >>       /* Update high watermark before we lower rss */
> >>       update_hiwater_rss(mm);
> >> @@ -1388,8 +1400,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >>                       set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
> >>
> >>               /* Move the dirty bit to the physical page now the pte is gone. */
> >> -             if (pte_dirty(pteval))
> >> +             if (pte_dirty(pteval)) {
> >> +                     SetPageUpdateCMTime(page);
> >>                       set_page_dirty(page);
> >> +             }
> >>
> >>               page_remove_rmap(page);
> >>               page_cache_release(page);
> >> --
> >> 1.7.6.4
> >>
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
  2011-11-02 15:02                     ` Jan Kara
@ 2011-11-02 15:19                       ` Ted Ts'o
  -1 siblings, 0 replies; 49+ messages in thread
From: Ted Ts'o @ 2011-11-02 15:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andy Lutomirski, Andreas Dilger, linux-kernel, linux-mm,
	linux-fsdevel, linux-ext4

On Wed, Nov 02, 2011 at 04:02:00PM +0100, Jan Kara wrote:
>   That's a good question. Locking of i_flags was always kind of unclear to
> me. They are certainly read without any locks and in the couple of places
> where setting can actually race VFS uses i_mutex for serialization which is
> kind of overkill (and unusable from page fault due to locking constraints).
> Probably using atomic bitops for setting i_flags would be needed.

Adding a set of inode_{test,set,clear}_flag() inline functions, and
then converting accesses of i_flags to use them would be a great
cleanup task.  It's been on my mental todo list for a while, but it's
a pretty invasive change.  What we have right now is definitely racy,
though, and we only get away with it because i_flags changes happen
relatively rarely.  Fixing this would be definitely be a Good Thing.

						- Ted

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] mm: Improve cmtime update on shared writable mmaps
@ 2011-11-02 15:19                       ` Ted Ts'o
  0 siblings, 0 replies; 49+ messages in thread
From: Ted Ts'o @ 2011-11-02 15:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andy Lutomirski, Andreas Dilger, linux-kernel, linux-mm,
	linux-fsdevel, linux-ext4

On Wed, Nov 02, 2011 at 04:02:00PM +0100, Jan Kara wrote:
>   That's a good question. Locking of i_flags was always kind of unclear to
> me. They are certainly read without any locks and in the couple of places
> where setting can actually race VFS uses i_mutex for serialization which is
> kind of overkill (and unusable from page fault due to locking constraints).
> Probably using atomic bitops for setting i_flags would be needed.

Adding a set of inode_{test,set,clear}_flag() inline functions, and
then converting accesses of i_flags to use them would be a great
cleanup task.  It's been on my mental todo list for a while, but it's
a pretty invasive change.  What we have right now is definitely racy,
though, and we only get away with it because i_flags changes happen
relatively rarely.  Fixing this would be definitely be a Good Thing.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
  2011-11-02  1:51                       ` Andy Lutomirski
  (?)
@ 2011-11-02 20:17                         ` Jan Kara
  -1 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-02 20:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue 01-11-11 18:51:04, Andy Lutomirski wrote:
> On Tue, Nov 1, 2011 at 4:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:
> >> Avoiding IO during a minor fault would be a decent thing which might be
> >> worth pursuing. As you properly noted "stable pages during writeback"
> >> requirement is one obstacle which won't be that trivial to avoid though...
> >
> > There's an easy solution that would be good enough for me: add a mount
> > option to turn off stable pages.
> >
> > Is the other problem just a race, perhaps?  __block_page_mkwrite calls
> > __block_write_begin (which calls get_block, which I think is where the
> > latency comes from) *before* wait_on_page_writeback, which means that
> > there might not be any space allocated yet.
> 
> I think I'm right (other than calling it a race).  If I change my code to do:
> 
> - map the file (with MCL_FUTURE set)
> - fallocate
> - dirty all pages
> - fsync
> - dirty all pages again
> 
> in the non-real-time thread, then a short test that was a mediocre
> reproducer seems to work.
> 
> This is annoying, though -- I'm not generating twice as much write I/O
> as I used to.  Is there any way to force the delalloc code to do its
> thing without triggering writeback?  I don't think fallocate has this
> effect.
  fallocate() will preallocate blocks on disk backing the mapped page. That
should get rid of latency in __block_write_begin(). Extents will still be
marked as uninitialized, but conversion from uninitialized to initialized
state happens during writeback / IO completion so you should not care much
about it.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-11-02 20:17                         ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-02 20:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue 01-11-11 18:51:04, Andy Lutomirski wrote:
> On Tue, Nov 1, 2011 at 4:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:
> >> Avoiding IO during a minor fault would be a decent thing which might be
> >> worth pursuing. As you properly noted "stable pages during writeback"
> >> requirement is one obstacle which won't be that trivial to avoid though...
> >
> > There's an easy solution that would be good enough for me: add a mount
> > option to turn off stable pages.
> >
> > Is the other problem just a race, perhaps?  __block_page_mkwrite calls
> > __block_write_begin (which calls get_block, which I think is where the
> > latency comes from) *before* wait_on_page_writeback, which means that
> > there might not be any space allocated yet.
> 
> I think I'm right (other than calling it a race).  If I change my code to do:
> 
> - map the file (with MCL_FUTURE set)
> - fallocate
> - dirty all pages
> - fsync
> - dirty all pages again
> 
> in the non-real-time thread, then a short test that was a mediocre
> reproducer seems to work.
> 
> This is annoying, though -- I'm not generating twice as much write I/O
> as I used to.  Is there any way to force the delalloc code to do its
> thing without triggering writeback?  I don't think fallocate has this
> effect.
  fallocate() will preallocate blocks on disk backing the mapped page. That
should get rid of latency in __block_write_begin(). Extents will still be
marked as uninitialized, but conversion from uninitialized to initialized
state happens during writeback / IO completion so you should not care much
about it.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Latency writing to an mlocked ext4 mapping
@ 2011-11-02 20:17                         ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-11-02 20:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Andreas Dilger, linux-kernel, linux-mm, linux-ext4

On Tue 01-11-11 18:51:04, Andy Lutomirski wrote:
> On Tue, Nov 1, 2011 at 4:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Tue, Nov 1, 2011 at 4:03 PM, Jan Kara <jack@suse.cz> wrote:
> >> Avoiding IO during a minor fault would be a decent thing which might be
> >> worth pursuing. As you properly noted "stable pages during writeback"
> >> requirement is one obstacle which won't be that trivial to avoid though...
> >
> > There's an easy solution that would be good enough for me: add a mount
> > option to turn off stable pages.
> >
> > Is the other problem just a race, perhaps?  __block_page_mkwrite calls
> > __block_write_begin (which calls get_block, which I think is where the
> > latency comes from) *before* wait_on_page_writeback, which means that
> > there might not be any space allocated yet.
> 
> I think I'm right (other than calling it a race).  If I change my code to do:
> 
> - map the file (with MCL_FUTURE set)
> - fallocate
> - dirty all pages
> - fsync
> - dirty all pages again
> 
> in the non-real-time thread, then a short test that was a mediocre
> reproducer seems to work.
> 
> This is annoying, though -- I'm not generating twice as much write I/O
> as I used to.  Is there any way to force the delalloc code to do its
> thing without triggering writeback?  I don't think fallocate has this
> effect.
  fallocate() will preallocate blocks on disk backing the mapped page. That
should get rid of latency in __block_write_begin(). Extents will still be
marked as uninitialized, but conversion from uninitialized to initialized
state happens during writeback / IO completion so you should not care much
about it.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2011-11-02 20:17 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-20  0:39 Latency writing to an mlocked ext4 mapping Andy Lutomirski
2011-10-20  0:39 ` Andy Lutomirski
2011-10-20  1:02 ` Andreas Dilger
2011-10-20  1:02   ` Andreas Dilger
2011-10-20  1:15   ` Andy Lutomirski
2011-10-20  1:15     ` Andy Lutomirski
2011-10-20  2:17     ` Andy Lutomirski
2011-10-20  2:17       ` Andy Lutomirski
2011-10-20  2:17       ` Andy Lutomirski
2011-10-20  5:59       ` Andy Lutomirski
2011-10-20  5:59         ` Andy Lutomirski
2011-10-20  5:59         ` Andy Lutomirski
2011-10-25 12:26         ` Jan Kara
2011-10-25 12:26           ` Jan Kara
2011-10-25 12:26           ` Jan Kara
2011-10-28 23:37           ` Andy Lutomirski
2011-10-28 23:37             ` Andy Lutomirski
2011-10-28 23:39             ` [PATCH] mm: Improve cmtime update on shared writable mmaps Andy Lutomirski
2011-10-28 23:39               ` Andy Lutomirski
2011-11-01 22:53               ` Jan Kara
2011-11-01 22:53                 ` Jan Kara
2011-11-01 23:02                 ` Andy Lutomirski
2011-11-01 23:02                   ` Andy Lutomirski
2011-11-01 23:02                   ` Andy Lutomirski
2011-11-02  7:38                   ` Christoph Hellwig
2011-11-02  7:38                     ` Christoph Hellwig
2011-11-02 15:02                   ` Jan Kara
2011-11-02 15:02                     ` Jan Kara
2011-11-02 15:02                     ` Jan Kara
2011-11-02 15:19                     ` Ted Ts'o
2011-11-02 15:19                       ` Ted Ts'o
2011-10-31 23:10             ` Latency writing to an mlocked ext4 mapping Jan Kara
2011-10-31 23:10               ` Jan Kara
2011-10-31 23:10               ` Jan Kara
2011-10-31 23:14               ` Andy Lutomirski
2011-10-31 23:14                 ` Andy Lutomirski
2011-10-31 23:14                 ` Andy Lutomirski
2011-11-01 23:03                 ` Jan Kara
2011-11-01 23:03                   ` Jan Kara
2011-11-01 23:03                   ` Jan Kara
2011-11-01 23:10                   ` Andy Lutomirski
2011-11-01 23:10                     ` Andy Lutomirski
2011-11-01 23:10                     ` Andy Lutomirski
2011-11-02  1:51                     ` Andy Lutomirski
2011-11-02  1:51                       ` Andy Lutomirski
2011-11-02  1:51                       ` Andy Lutomirski
2011-11-02 20:17                       ` Jan Kara
2011-11-02 20:17                         ` Jan Kara
2011-11-02 20:17                         ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.