All of lore.kernel.org
 help / color / mirror / Atom feed
* Filesystem crashes due to pages without buffers
@ 2018-01-03 10:04 ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-01-03 10:04 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-fsdevel, linux-xfs, linux-ext4, Dan Williams

Hello,

Over the years I have seen so far unexplained crashed in filesystem's
(ext4, xfs) writeback path due to dirty pages without buffers attached to
them (see [1] and [2] for relatively recent reports). This was confusing as
reclaim takes care not to strip buffers from a dirty page and both
filesystems do add buffers to a page when it is first written to - in
->page_mkwrite() and ->write_begin callbacks.

Recently I have come across a code path that is probably leading to this
inconsistent state and I'd like to discuss how to best fix the problem
because it's not obvious to me. Consider the following race:

CPU1					CPU2

addr = mmap(file1, MAP_SHARED, ...);
fd2 = open(file2, O_DIRECT | O_RDONLY);
read(fd2, addr, len)
  do_direct_IO()
    page = dio_get_page()
      dio_refill_pages()
        iov_iter_get_pages()
	  get_user_pages_fast()
            - page fault
              ->page_mkwrite()
                block_page_mkwrite()
                  lock_page(page);
                  - attaches buffers to page
                  - makes sure blocks are allocated
                  set_page_dirty(page)
              - install writeable PTE
              unlock_page(page);
    submit_page_section(page)
      - submits bio with 'page' as a buffer
					kswapd reclaims pages:
					...
					shrink_page_list()
					  trylock_page(page) - this is the
					    page CPU1 has just faulted in
					  try_to_unmap(page)
					  pageout(page);
					    clear_page_dirty_for_io(page);
					    ->writepage()
					  - let's assume page got written
					    out fast enough, alternatively
					    we could get to the same path as
					    soon as the page IO completes
					  if (page_has_private(page)) {
					    try_to_release_page(page)
					      - reclaims buffers from the
					        page
					   __remove_mapping(page)
					     - fails as DIO code still
					       holds page reference
...

eventually read completes
  dio_bio_complete(bio)
    set_page_dirty_lock(page)
      Bummer, we've just marked the page as dirty without having buffers.
      Eventually writeback will find it and filesystem will complain...

Am I missing something?
 
The problem here is that filesystems fundamentally assume that a page can
be written to only between ->write_begin - ->write_end (in this interval
the page is locked), or between ->page_mkwrite - ->writepage and above is
an example where this does not hold because when a page reference is
acquired through get_user_pages(), page can get written to by the holder of
the reference and dirtied even after it has been unmapped from page tables
and ->writepage has been called. This is not only a cosmetic issue leading
to assertion failure but it can also lead to data loss, data corruption, or
other unpleasant surprises as filesystems assume page contents cannot be
modified until either ->write_begin() or ->page_mkwrite gets called and
those calls are serialized by proper locking with problematic operations
such as hole punching etc.

I'm not sure how to fix this problem. We could 'simulate' a writeable page
fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
virtual address of the fault, don't hold mmap_sem, etc., possibly
expensive, but it would make filesystems happy. Data stored by GUP user
(e.g. read by DIO in the above case) could still get lost if someone e.g.
punched hole under the buffer or otherwise messed with the underlying
storage of the page while DIO was running but arguably users could expect
such outcome.

Another possible solution would be to make sure page is writeably mapped
until GUP user drops its reference. That would be arguably cleaner but
probably that would mean we have to track number of writeable GUP page
references separately (no space space in struct page is a problem here) and
block page_mkclean() until they are dropped. Also for long term GUP users
like Infiniband or V4L we'd have to come up with some solution as we should
not block page_mkclean() for so long.

As a side note DAX needs some solution for GUP users as well. The problems
are similar there in nature, just much easier to hit. So at least a
solution for long-term GUP users can (and I strongly believe should) be
shared between standard and DAX paths.

Anybody has other ideas how to fix the problem or opinions on which
solution would be better to use or some complications I have missed?

								Honza

[1] https://www.spinics.net/lists/linux-xfs/msg10090.html
[2] https://www.spinics.net/lists/linux-ext4/msg54377.html

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Filesystem crashes due to pages without buffers
@ 2018-01-03 10:04 ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-01-03 10:04 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-fsdevel, linux-xfs, linux-ext4, Dan Williams

Hello,

Over the years I have seen so far unexplained crashed in filesystem's
(ext4, xfs) writeback path due to dirty pages without buffers attached to
them (see [1] and [2] for relatively recent reports). This was confusing as
reclaim takes care not to strip buffers from a dirty page and both
filesystems do add buffers to a page when it is first written to - in
->page_mkwrite() and ->write_begin callbacks.

Recently I have come across a code path that is probably leading to this
inconsistent state and I'd like to discuss how to best fix the problem
because it's not obvious to me. Consider the following race:

CPU1					CPU2

addr = mmap(file1, MAP_SHARED, ...);
fd2 = open(file2, O_DIRECT | O_RDONLY);
read(fd2, addr, len)
  do_direct_IO()
    page = dio_get_page()
      dio_refill_pages()
        iov_iter_get_pages()
	  get_user_pages_fast()
            - page fault
              ->page_mkwrite()
                block_page_mkwrite()
                  lock_page(page);
                  - attaches buffers to page
                  - makes sure blocks are allocated
                  set_page_dirty(page)
              - install writeable PTE
              unlock_page(page);
    submit_page_section(page)
      - submits bio with 'page' as a buffer
					kswapd reclaims pages:
					...
					shrink_page_list()
					  trylock_page(page) - this is the
					    page CPU1 has just faulted in
					  try_to_unmap(page)
					  pageout(page);
					    clear_page_dirty_for_io(page);
					    ->writepage()
					  - let's assume page got written
					    out fast enough, alternatively
					    we could get to the same path as
					    soon as the page IO completes
					  if (page_has_private(page)) {
					    try_to_release_page(page)
					      - reclaims buffers from the
					        page
					   __remove_mapping(page)
					     - fails as DIO code still
					       holds page reference
...

eventually read completes
  dio_bio_complete(bio)
    set_page_dirty_lock(page)
      Bummer, we've just marked the page as dirty without having buffers.
      Eventually writeback will find it and filesystem will complain...

Am I missing something?
 
The problem here is that filesystems fundamentally assume that a page can
be written to only between ->write_begin - ->write_end (in this interval
the page is locked), or between ->page_mkwrite - ->writepage and above is
an example where this does not hold because when a page reference is
acquired through get_user_pages(), page can get written to by the holder of
the reference and dirtied even after it has been unmapped from page tables
and ->writepage has been called. This is not only a cosmetic issue leading
to assertion failure but it can also lead to data loss, data corruption, or
other unpleasant surprises as filesystems assume page contents cannot be
modified until either ->write_begin() or ->page_mkwrite gets called and
those calls are serialized by proper locking with problematic operations
such as hole punching etc.

I'm not sure how to fix this problem. We could 'simulate' a writeable page
fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
virtual address of the fault, don't hold mmap_sem, etc., possibly
expensive, but it would make filesystems happy. Data stored by GUP user
(e.g. read by DIO in the above case) could still get lost if someone e.g.
punched hole under the buffer or otherwise messed with the underlying
storage of the page while DIO was running but arguably users could expect
such outcome.

Another possible solution would be to make sure page is writeably mapped
until GUP user drops its reference. That would be arguably cleaner but
probably that would mean we have to track number of writeable GUP page
references separately (no space space in struct page is a problem here) and
block page_mkclean() until they are dropped. Also for long term GUP users
like Infiniband or V4L we'd have to come up with some solution as we should
not block page_mkclean() for so long.

As a side note DAX needs some solution for GUP users as well. The problems
are similar there in nature, just much easier to hit. So at least a
solution for long-term GUP users can (and I strongly believe should) be
shared between standard and DAX paths.

Anybody has other ideas how to fix the problem or opinions on which
solution would be better to use or some complications I have missed?

								Honza

[1] https://www.spinics.net/lists/linux-xfs/msg10090.html
[2] https://www.spinics.net/lists/linux-ext4/msg54377.html

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-01-03 10:04 ` Jan Kara
@ 2018-01-04  4:56   ` Dan Williams
  -1 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2018-01-04  4:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linux MM, linux-fsdevel, linux-xfs, linux-ext4

On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@suse.cz> wrote:
> Hello,
>
> Over the years I have seen so far unexplained crashed in filesystem's
> (ext4, xfs) writeback path due to dirty pages without buffers attached to
> them (see [1] and [2] for relatively recent reports). This was confusing as
> reclaim takes care not to strip buffers from a dirty page and both
> filesystems do add buffers to a page when it is first written to - in
> ->page_mkwrite() and ->write_begin callbacks.
>
> Recently I have come across a code path that is probably leading to this
> inconsistent state and I'd like to discuss how to best fix the problem
> because it's not obvious to me. Consider the following race:
>
> CPU1                                    CPU2
>
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     page = dio_get_page()
>       dio_refill_pages()
>         iov_iter_get_pages()
>           get_user_pages_fast()
>             - page fault
>               ->page_mkwrite()
>                 block_page_mkwrite()
>                   lock_page(page);
>                   - attaches buffers to page
>                   - makes sure blocks are allocated
>                   set_page_dirty(page)
>               - install writeable PTE
>               unlock_page(page);
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
>                                         kswapd reclaims pages:
>                                         ...
>                                         shrink_page_list()
>                                           trylock_page(page) - this is the
>                                             page CPU1 has just faulted in
>                                           try_to_unmap(page)
>                                           pageout(page);
>                                             clear_page_dirty_for_io(page);
>                                             ->writepage()
>                                           - let's assume page got written
>                                             out fast enough, alternatively
>                                             we could get to the same path as
>                                             soon as the page IO completes
>                                           if (page_has_private(page)) {
>                                             try_to_release_page(page)
>                                               - reclaims buffers from the
>                                                 page
>                                            __remove_mapping(page)
>                                              - fails as DIO code still
>                                                holds page reference
> ...
>
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
>       Bummer, we've just marked the page as dirty without having buffers.
>       Eventually writeback will find it and filesystem will complain...
>
> Am I missing something?
>
> The problem here is that filesystems fundamentally assume that a page can
> be written to only between ->write_begin - ->write_end (in this interval
> the page is locked), or between ->page_mkwrite - ->writepage and above is
> an example where this does not hold because when a page reference is
> acquired through get_user_pages(), page can get written to by the holder of
> the reference and dirtied even after it has been unmapped from page tables
> and ->writepage has been called. This is not only a cosmetic issue leading
> to assertion failure but it can also lead to data loss, data corruption, or
> other unpleasant surprises as filesystems assume page contents cannot be
> modified until either ->write_begin() or ->page_mkwrite gets called and
> those calls are serialized by proper locking with problematic operations
> such as hole punching etc.
>
> I'm not sure how to fix this problem. We could 'simulate' a writeable page
> fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> virtual address of the fault, don't hold mmap_sem, etc., possibly
> expensive, but it would make filesystems happy. Data stored by GUP user
> (e.g. read by DIO in the above case) could still get lost if someone e.g.
> punched hole under the buffer or otherwise messed with the underlying
> storage of the page while DIO was running but arguably users could expect
> such outcome.
>
> Another possible solution would be to make sure page is writeably mapped
> until GUP user drops its reference. That would be arguably cleaner but
> probably that would mean we have to track number of writeable GUP page
> references separately (no space space in struct page is a problem here) and
> block page_mkclean() until they are dropped. Also for long term GUP users
> like Infiniband or V4L we'd have to come up with some solution as we should
> not block page_mkclean() for so long.

Do we need to block page_mkclean, or could we defer buffer reclaiming
to the last put of the page?

I think once we have the "register memory with lease" mechanism for
Infiniband we could expand it to the page cache case. The problem is
the regression this would cause with userspace that expects it can
maintain file backed memory registrations indefinitely.

What are the implications of holding off page_mkclean or release
buffers indefinitely?

Is an indefinite / interruptible sleep waiting for the 'put' event of
a get_user_pages() page unacceptable? The current case that the file
contents will not be coherent with respect to in-flight RDMA, perhaps
waiting for that to complete is better than cleaning buffers from the
page prematurely.

Yes, I have more questions than proposals.

>
> As a side note DAX needs some solution for GUP users as well. The problems
> are similar there in nature, just much easier to hit. So at least a
> solution for long-term GUP users can (and I strongly believe should) be
> shared between standard and DAX paths.

In the DAX case we rely on the fact that when the page goes idle we
only need to worry about the filesytem block map changing, the page
won't get reallocated somewhere else. We can't use page idle as an
event in this case, however, if the page reference count is one then
the DIO code can know that it has the page exclusively, so maybe DAX
and non-DAX can share the page count == 1 event notification.

However there's still the matter of how to callback into the
filesystem. The DAX case is currently using a pgmap_radix lookup at
put_page() time to determine when to wakeup waiters. I think this
should move over to a new address_space_operation. That would help
with reusing some of the DAX case machinery for this case, but the DIO
code would need a special case put_page that checks for count == 1 and
synchronization to hold off the DIO submission path to prevent new
page elevations.

...are you sure this is still similar enough to the DAX case that they
can reuse much of the same machinery?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
@ 2018-01-04  4:56   ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2018-01-04  4:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linux MM, linux-fsdevel, linux-xfs, linux-ext4

On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@suse.cz> wrote:
> Hello,
>
> Over the years I have seen so far unexplained crashed in filesystem's
> (ext4, xfs) writeback path due to dirty pages without buffers attached to
> them (see [1] and [2] for relatively recent reports). This was confusing as
> reclaim takes care not to strip buffers from a dirty page and both
> filesystems do add buffers to a page when it is first written to - in
> ->page_mkwrite() and ->write_begin callbacks.
>
> Recently I have come across a code path that is probably leading to this
> inconsistent state and I'd like to discuss how to best fix the problem
> because it's not obvious to me. Consider the following race:
>
> CPU1                                    CPU2
>
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     page = dio_get_page()
>       dio_refill_pages()
>         iov_iter_get_pages()
>           get_user_pages_fast()
>             - page fault
>               ->page_mkwrite()
>                 block_page_mkwrite()
>                   lock_page(page);
>                   - attaches buffers to page
>                   - makes sure blocks are allocated
>                   set_page_dirty(page)
>               - install writeable PTE
>               unlock_page(page);
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
>                                         kswapd reclaims pages:
>                                         ...
>                                         shrink_page_list()
>                                           trylock_page(page) - this is the
>                                             page CPU1 has just faulted in
>                                           try_to_unmap(page)
>                                           pageout(page);
>                                             clear_page_dirty_for_io(page);
>                                             ->writepage()
>                                           - let's assume page got written
>                                             out fast enough, alternatively
>                                             we could get to the same path as
>                                             soon as the page IO completes
>                                           if (page_has_private(page)) {
>                                             try_to_release_page(page)
>                                               - reclaims buffers from the
>                                                 page
>                                            __remove_mapping(page)
>                                              - fails as DIO code still
>                                                holds page reference
> ...
>
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
>       Bummer, we've just marked the page as dirty without having buffers.
>       Eventually writeback will find it and filesystem will complain...
>
> Am I missing something?
>
> The problem here is that filesystems fundamentally assume that a page can
> be written to only between ->write_begin - ->write_end (in this interval
> the page is locked), or between ->page_mkwrite - ->writepage and above is
> an example where this does not hold because when a page reference is
> acquired through get_user_pages(), page can get written to by the holder of
> the reference and dirtied even after it has been unmapped from page tables
> and ->writepage has been called. This is not only a cosmetic issue leading
> to assertion failure but it can also lead to data loss, data corruption, or
> other unpleasant surprises as filesystems assume page contents cannot be
> modified until either ->write_begin() or ->page_mkwrite gets called and
> those calls are serialized by proper locking with problematic operations
> such as hole punching etc.
>
> I'm not sure how to fix this problem. We could 'simulate' a writeable page
> fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> virtual address of the fault, don't hold mmap_sem, etc., possibly
> expensive, but it would make filesystems happy. Data stored by GUP user
> (e.g. read by DIO in the above case) could still get lost if someone e.g.
> punched hole under the buffer or otherwise messed with the underlying
> storage of the page while DIO was running but arguably users could expect
> such outcome.
>
> Another possible solution would be to make sure page is writeably mapped
> until GUP user drops its reference. That would be arguably cleaner but
> probably that would mean we have to track number of writeable GUP page
> references separately (no space space in struct page is a problem here) and
> block page_mkclean() until they are dropped. Also for long term GUP users
> like Infiniband or V4L we'd have to come up with some solution as we should
> not block page_mkclean() for so long.

Do we need to block page_mkclean, or could we defer buffer reclaiming
to the last put of the page?

I think once we have the "register memory with lease" mechanism for
Infiniband we could expand it to the page cache case. The problem is
the regression this would cause with userspace that expects it can
maintain file backed memory registrations indefinitely.

What are the implications of holding off page_mkclean or release
buffers indefinitely?

Is an indefinite / interruptible sleep waiting for the 'put' event of
a get_user_pages() page unacceptable? The current case that the file
contents will not be coherent with respect to in-flight RDMA, perhaps
waiting for that to complete is better than cleaning buffers from the
page prematurely.

Yes, I have more questions than proposals.

>
> As a side note DAX needs some solution for GUP users as well. The problems
> are similar there in nature, just much easier to hit. So at least a
> solution for long-term GUP users can (and I strongly believe should) be
> shared between standard and DAX paths.

In the DAX case we rely on the fact that when the page goes idle we
only need to worry about the filesytem block map changing, the page
won't get reallocated somewhere else. We can't use page idle as an
event in this case, however, if the page reference count is one then
the DIO code can know that it has the page exclusively, so maybe DAX
and non-DAX can share the page count == 1 event notification.

However there's still the matter of how to callback into the
filesystem. The DAX case is currently using a pgmap_radix lookup at
put_page() time to determine when to wakeup waiters. I think this
should move over to a new address_space_operation. That would help
with reusing some of the DAX case machinery for this case, but the DIO
code would need a special case put_page that checks for count == 1 and
synchronization to hold off the DIO submission path to prevent new
page elevations.

...are you sure this is still similar enough to the DAX case that they
can reuse much of the same machinery?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-01-03 10:04 ` Jan Kara
@ 2018-01-04  5:59   ` Dave Chinner
  -1 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2018-01-04  5:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-fsdevel, linux-xfs, linux-ext4, Dan Williams

On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> Hello,
> 
> Over the years I have seen so far unexplained crashed in filesystem's
> (ext4, xfs) writeback path due to dirty pages without buffers attached to
> them (see [1] and [2] for relatively recent reports). This was confusing as
> reclaim takes care not to strip buffers from a dirty page and both
> filesystems do add buffers to a page when it is first written to - in
> ->page_mkwrite() and ->write_begin callbacks.
> 
> Recently I have come across a code path that is probably leading to this
> inconsistent state and I'd like to discuss how to best fix the problem
> because it's not obvious to me. Consider the following race:
> 
> CPU1					CPU2
> 
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     page = dio_get_page()
>       dio_refill_pages()
>         iov_iter_get_pages()
> 	  get_user_pages_fast()
>             - page fault
>               ->page_mkwrite()
>                 block_page_mkwrite()
>                   lock_page(page);
>                   - attaches buffers to page
>                   - makes sure blocks are allocated
>                   set_page_dirty(page)
>               - install writeable PTE
>               unlock_page(page);
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
> 					kswapd reclaims pages:
> 					...
> 					shrink_page_list()
> 					  trylock_page(page) - this is the
> 					    page CPU1 has just faulted in
> 					  try_to_unmap(page)
> 					  pageout(page);
> 					    clear_page_dirty_for_io(page);
> 					    ->writepage()
> 					  - let's assume page got written
> 					    out fast enough, alternatively
> 					    we could get to the same path as
> 					    soon as the page IO completes
> 					  if (page_has_private(page)) {
> 					    try_to_release_page(page)
> 					      - reclaims buffers from the
> 					        page
> 					   __remove_mapping(page)
> 					     - fails as DIO code still
> 					       holds page reference
> ...
> 
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
>       Bummer, we've just marked the page as dirty without having buffers.
>       Eventually writeback will find it and filesystem will complain...
> 
> Am I missing something?

My first question is why is kswapd trying to reclaim a page with an
elevated active reference count? i.e. there are active references
the VM *doesn't own* to the page, which means that there may well
a user that expects the state on the page (e.g. the page private
data that the active reference instantiated!) to remain intact until
it drops it's active reference.

That seems like really basic reference counting/reclaim bug to me:
we shouldn't ever attempt to reclaim and free an object while there
are active external references to it that object, regardless of the
subsystem the object belongs to....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
@ 2018-01-04  5:59   ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2018-01-04  5:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-fsdevel, linux-xfs, linux-ext4, Dan Williams

On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> Hello,
> 
> Over the years I have seen so far unexplained crashed in filesystem's
> (ext4, xfs) writeback path due to dirty pages without buffers attached to
> them (see [1] and [2] for relatively recent reports). This was confusing as
> reclaim takes care not to strip buffers from a dirty page and both
> filesystems do add buffers to a page when it is first written to - in
> ->page_mkwrite() and ->write_begin callbacks.
> 
> Recently I have come across a code path that is probably leading to this
> inconsistent state and I'd like to discuss how to best fix the problem
> because it's not obvious to me. Consider the following race:
> 
> CPU1					CPU2
> 
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     page = dio_get_page()
>       dio_refill_pages()
>         iov_iter_get_pages()
> 	  get_user_pages_fast()
>             - page fault
>               ->page_mkwrite()
>                 block_page_mkwrite()
>                   lock_page(page);
>                   - attaches buffers to page
>                   - makes sure blocks are allocated
>                   set_page_dirty(page)
>               - install writeable PTE
>               unlock_page(page);
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
> 					kswapd reclaims pages:
> 					...
> 					shrink_page_list()
> 					  trylock_page(page) - this is the
> 					    page CPU1 has just faulted in
> 					  try_to_unmap(page)
> 					  pageout(page);
> 					    clear_page_dirty_for_io(page);
> 					    ->writepage()
> 					  - let's assume page got written
> 					    out fast enough, alternatively
> 					    we could get to the same path as
> 					    soon as the page IO completes
> 					  if (page_has_private(page)) {
> 					    try_to_release_page(page)
> 					      - reclaims buffers from the
> 					        page
> 					   __remove_mapping(page)
> 					     - fails as DIO code still
> 					       holds page reference
> ...
> 
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
>       Bummer, we've just marked the page as dirty without having buffers.
>       Eventually writeback will find it and filesystem will complain...
> 
> Am I missing something?

My first question is why is kswapd trying to reclaim a page with an
elevated active reference count? i.e. there are active references
the VM *doesn't own* to the page, which means that there may well
a user that expects the state on the page (e.g. the page private
data that the active reference instantiated!) to remain intact until
it drops it's active reference.

That seems like really basic reference counting/reclaim bug to me:
we shouldn't ever attempt to reclaim and free an object while there
are active external references to it that object, regardless of the
subsystem the object belongs to....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-01-03 10:04 ` Jan Kara
                   ` (2 preceding siblings ...)
  (?)
@ 2018-01-04  6:10 ` Leon Romanovsky
  -1 siblings, 0 replies; 16+ messages in thread
From: Leon Romanovsky @ 2018-01-04  6:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-ext4, Dan Williams,
	RDMA mailing list, Majd Dibbiny

[-- Attachment #1: Type: text/plain, Size: 9537 bytes --]

On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> Hello,
>
> Over the years I have seen so far unexplained crashed in filesystem's
> (ext4, xfs) writeback path due to dirty pages without buffers attached to
> them (see [1] and [2] for relatively recent reports). This was confusing as
> reclaim takes care not to strip buffers from a dirty page and both
> filesystems do add buffers to a page when it is first written to - in
> ->page_mkwrite() and ->write_begin callbacks.
>
> Recently I have come across a code path that is probably leading to this
> inconsistent state and I'd like to discuss how to best fix the problem
> because it's not obvious to me. Consider the following race:
>
> CPU1					CPU2
>
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     page = dio_get_page()
>       dio_refill_pages()
>         iov_iter_get_pages()
> 	  get_user_pages_fast()
>             - page fault
>               ->page_mkwrite()
>                 block_page_mkwrite()
>                   lock_page(page);
>                   - attaches buffers to page
>                   - makes sure blocks are allocated
>                   set_page_dirty(page)
>               - install writeable PTE
>               unlock_page(page);
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
> 					kswapd reclaims pages:
> 					...
> 					shrink_page_list()
> 					  trylock_page(page) - this is the
> 					    page CPU1 has just faulted in
> 					  try_to_unmap(page)
> 					  pageout(page);
> 					    clear_page_dirty_for_io(page);
> 					    ->writepage()
> 					  - let's assume page got written
> 					    out fast enough, alternatively
> 					    we could get to the same path as
> 					    soon as the page IO completes
> 					  if (page_has_private(page)) {
> 					    try_to_release_page(page)
> 					      - reclaims buffers from the
> 					        page
> 					   __remove_mapping(page)
> 					     - fails as DIO code still
> 					       holds page reference
> ...
>
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
>       Bummer, we've just marked the page as dirty without having buffers.
>       Eventually writeback will find it and filesystem will complain...
>
> Am I missing something?
>
> The problem here is that filesystems fundamentally assume that a page can
> be written to only between ->write_begin - ->write_end (in this interval
> the page is locked), or between ->page_mkwrite - ->writepage and above is
> an example where this does not hold because when a page reference is
> acquired through get_user_pages(), page can get written to by the holder of
> the reference and dirtied even after it has been unmapped from page tables
> and ->writepage has been called. This is not only a cosmetic issue leading
> to assertion failure but it can also lead to data loss, data corruption, or
> other unpleasant surprises as filesystems assume page contents cannot be
> modified until either ->write_begin() or ->page_mkwrite gets called and
> those calls are serialized by proper locking with problematic operations
> such as hole punching etc.
>
> I'm not sure how to fix this problem. We could 'simulate' a writeable page
> fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> virtual address of the fault, don't hold mmap_sem, etc., possibly
> expensive, but it would make filesystems happy. Data stored by GUP user
> (e.g. read by DIO in the above case) could still get lost if someone e.g.
> punched hole under the buffer or otherwise messed with the underlying
> storage of the page while DIO was running but arguably users could expect
> such outcome.
>
> Another possible solution would be to make sure page is writeably mapped
> until GUP user drops its reference. That would be arguably cleaner but
> probably that would mean we have to track number of writeable GUP page
> references separately (no space space in struct page is a problem here) and
> block page_mkclean() until they are dropped. Also for long term GUP users
> like Infiniband or V4L we'd have to come up with some solution as we should
> not block page_mkclean() for so long.
>
> As a side note DAX needs some solution for GUP users as well. The problems
> are similar there in nature, just much easier to hit. So at least a
> solution for long-term GUP users can (and I strongly believe should) be
> shared between standard and DAX paths.
>
> Anybody has other ideas how to fix the problem or opinions on which
> solution would be better to use or some complications I have missed?
>

+RDMA

Hi Jan,

I don't have actual proposals how to fix, but wanted to mention that
we have a customer who experiences those failures in his setup.
In his case, it is reproducible in 100% of cases in approximately 2
minutes of run.

His application creates two memory regions with ib_umem_get(), one is
backed by ext4 and another is anonymous. Approximately after two minutes
of data traffic, he stops the system and calls to release those memory
regions with ib_umem_release()->__ib_umem_release()->set_page_dirty_lock().

A couple of seconds later, he hits the following BUG_ON.

[ 1411.545311] ------------[ cut here ]------------
[ 1411.545340] kernel BUG at fs/ext4/inode.c:2297!
[ 1411.545360] invalid opcode: 0000 [#1] SMP
[ 1411.545381] Modules linked in: xt_nat veth ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack
br_netfilter bridge stp llc overlay(T) rdma_ucm(OE) ib_ucm(OE)
rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE)
mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE)
mlx4_core(OE) devlink mlx_compat(OE) intel_powerclamp coretemp
intel_rapl iosf_mbi kvm_intel vfat fat kvm irqbypass crc32_pclmul
ghash_clmulni_intel aesni_intel lrw gf128mul ext4 glue_helper
ablk_helper cryptd mbcache jbd2 iTCO_wdt iTCO_vendor_support mxm_wmi
pcspkr sb_edac edac_core i2c_i801 sg mei_me mei shpchp lpc_ich
ipmi_devintf ipmi_si ipmi_msghandler acpi_pad wmi acpi_power_meter
knem(OE) nfsd auth_rpcgss
[ 1411.545744] nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
crc_t10dif crct10dif_generic ast drm_kms_helper crct10dif_pclmul
crct10dif_common syscopyarea sysfillrect crc32c_intel mpt3sas sysimgblt
fb_sys_fops ttm ahci raid_class libahci scsi_transport_sas igb drm
libata dca i2c_algo_bit ptp nvme i2c_core pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: devlink]
[ 1411.545926] CPU: 6 PID: 13195 Comm: node_runner_w8 Tainted: G W OE
------------ T 3.10.0-514.21.1.el7.debug_bz1368895.x86_64 #1
[ 1411.545975] Hardware name: Quanta Computer Inc D51BP-1U (dual 1G LoM)/S2BP-MB (dual 1G LoM), BIOS S2BP3B04 03/03/2016
[ 1411.546017] task: ffff881e4b323ec0 ti: ffff881e49bbc000 task.ti: ffff881e49bbc000
[ 1411.546047] RIP: 0010:[<ffffffffa07083e5>] [<ffffffffa07083e5>] mpage_prepare_extent_to_map+0x2d5/0x2e0 [ext4]
[ 1411.546103] RSP: 0018:ffff881e49bbfc10 EFLAGS: 00010246
[ 1411.546125] RAX: 001fffff0000003d RBX: ffff881e49bbfc68 RCX: 0000000000000170
[ 1411.546154] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88207ff8dde8
[ 1411.546183] RBP: ffff881e49bbfce8 R08: 0000000000000000 R09: 0000000000000001
[ 1411.546212] R10: 57fe04c2df4f6680 R11: 0000000000000008 R12: 7ffffffffffffe9e
[ 1411.546240] R13: 000000000003ffff R14: ffffea0001449a00 R15: ffff881e49bbfd90
[ 1411.546270] FS: 00007f5dd5de7d40(0000) GS:ffff881fffb80000(0000) knlGS:0000000000000000
[ 1411.546302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1411.546325] CR2: 00007f5b0c5969c0 CR3: 0000001e5e6ae000 CR4: 00000000003407e0
[ 1411.546354] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1411.546383] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1411.546412] Stack:
[ 1411.546421] ffff881e49bbfc50 0000000000000002 ffff881f07b00628 ffff881e49bbfcb8
[ 1411.546456] 000000000000017b 000000000000000e 0000000000000000 ffffea00794f6600
[ 1411.546490] ffffea00794f6640 ffffea00794f6680 ffffea0001449a00 ffffea00014499c0
[ 1411.546524] Call Trace:
[ 1411.546545] [<ffffffffa070c9ab>] ext4_writepages+0x45b/0xd60 [ext4]
[ 1411.546576] [<ffffffff8118d93e>] do_writepages+0x1e/0x40
[ 1411.546601] [<ffffffff811824f5>] __filemap_fdatawrite_range+0x65/0x80
[ 1411.546629] [<ffffffff81182641>] filemap_write_and_wait_range+0x41/0x90
[ 1411.546664] [<ffffffffa0703bba>] ext4_sync_file+0xba/0x320 [ext4]
[ 1411.546692] [<ffffffff8123028d>] vfs_fsync_range+0x1d/0x30
[ 1411.546717] [<ffffffff811ba89e>] SyS_msync+0x1fe/0x250
[ 1411.546741] [<ffffffff816974c9>] system_call_fastpath+0x16/0x1b
[ 1411.546765] Code: ff ff ff e8 2e 7e a8 e0 8b 85 40 ff ff ff eb c2 48
8d bd 50 ff ff ff e8 1a 7e a8 e0 eb 8c 4c 89 f7 e8 a0 81 a7 e0 e9 d5 fe
ff ff <0f> 0b 0f 0b e8 62 d6 97 e0 66 90 0f 1f 44 00 00 55 48 89 e5 41
[ 1411.548075] RIP [<ffffffffa07083e5>] mpage_prepare_extent_to_map+0x2d5/0x2e0 [ext4]
[ 1411.549292] RSP <ffff881e49bbfc10>
---here vmcore-dmesg got cut off----

Thanks

> 								Honza
>
> [1] https://www.spinics.net/lists/linux-xfs/msg10090.html
> [2] https://www.spinics.net/lists/linux-ext4/msg54377.html
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-01-04  5:59   ` Dave Chinner
@ 2018-01-04  8:52     ` Jan Kara
  -1 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-01-04  8:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-mm, linux-fsdevel, linux-xfs, linux-ext4, Dan Williams

On Thu 04-01-18 16:59:19, Dave Chinner wrote:
> On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> > Hello,
> > 
> > Over the years I have seen so far unexplained crashed in filesystem's
> > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > them (see [1] and [2] for relatively recent reports). This was confusing as
> > reclaim takes care not to strip buffers from a dirty page and both
> > filesystems do add buffers to a page when it is first written to - in
> > ->page_mkwrite() and ->write_begin callbacks.
> > 
> > Recently I have come across a code path that is probably leading to this
> > inconsistent state and I'd like to discuss how to best fix the problem
> > because it's not obvious to me. Consider the following race:
> > 
> > CPU1					CPU2
> > 
> > addr = mmap(file1, MAP_SHARED, ...);
> > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > read(fd2, addr, len)
> >   do_direct_IO()
> >     page = dio_get_page()
> >       dio_refill_pages()
> >         iov_iter_get_pages()
> > 	  get_user_pages_fast()
> >             - page fault
> >               ->page_mkwrite()
> >                 block_page_mkwrite()
> >                   lock_page(page);
> >                   - attaches buffers to page
> >                   - makes sure blocks are allocated
> >                   set_page_dirty(page)
> >               - install writeable PTE
> >               unlock_page(page);
> >     submit_page_section(page)
> >       - submits bio with 'page' as a buffer
> > 					kswapd reclaims pages:
> > 					...
> > 					shrink_page_list()
> > 					  trylock_page(page) - this is the
> > 					    page CPU1 has just faulted in
> > 					  try_to_unmap(page)
> > 					  pageout(page);
> > 					    clear_page_dirty_for_io(page);
> > 					    ->writepage()
> > 					  - let's assume page got written
> > 					    out fast enough, alternatively
> > 					    we could get to the same path as
> > 					    soon as the page IO completes
> > 					  if (page_has_private(page)) {
> > 					    try_to_release_page(page)
> > 					      - reclaims buffers from the
> > 					        page
> > 					   __remove_mapping(page)
> > 					     - fails as DIO code still
> > 					       holds page reference
> > ...
> > 
> > eventually read completes
> >   dio_bio_complete(bio)
> >     set_page_dirty_lock(page)
> >       Bummer, we've just marked the page as dirty without having buffers.
> >       Eventually writeback will find it and filesystem will complain...
> > 
> > Am I missing something?
> 
> My first question is why is kswapd trying to reclaim a page with an
> elevated active reference count? i.e. there are active references
> the VM *doesn't own* to the page, which means that there may well
> a user that expects the state on the page (e.g. the page private
> data that the active reference instantiated!) to remain intact until
> it drops it's active reference.

Page private data (and most of page state) is protected by a page lock, not
by a page reference. So reclaim (which is holding the page lock) is free to
try to reclaim page private data by calling ->releasepage callback.

That being said you are right that the attempt to reclaim a page with
active references is futile. But the problem is that we don't know how many
page references are actually left before we unmap the page from page tables
(each page table entry holds a page reference) and free page private data
(as that may hold page reference as well - e.g. attach_page_buffers()
acquires page reference). So checking page references in advance is
difficult.

Furthermore the core of the problem is not in the fact that page buffers
are reclaimed. That just makes it visible. The real problem is that page can
be written to by a GUP user while it is neither writeably mapped in page
tables nor prepared with ->write_begin. So a similar race violating
filesystem's assumptions can be like:

CPU1					CPU2

addr = mmap(file1, MAP_SHARED, ...);
fd2 = open(file2, O_DIRECT | O_RDONLY);
read(fd2, addr, len)
  do_direct_IO()
    ...
    page = get_user_pages_fast()
      - page fault handled
    submit_page_section(page)
      - submits bio with 'page' as a buffer
					ordinary writeback:
					writepages(file1)
					  clear_page_dirty_for_io(page)
					    - page gets writeprotected in
					      page tables
...
eventually read completes
  dio_bio_complete(bio)
    set_page_dirty_lock(page)

And a race like this is enough to cause data corruption if we are unlucky.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
@ 2018-01-04  8:52     ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-01-04  8:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-mm, linux-fsdevel, linux-xfs, linux-ext4, Dan Williams

On Thu 04-01-18 16:59:19, Dave Chinner wrote:
> On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> > Hello,
> > 
> > Over the years I have seen so far unexplained crashed in filesystem's
> > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > them (see [1] and [2] for relatively recent reports). This was confusing as
> > reclaim takes care not to strip buffers from a dirty page and both
> > filesystems do add buffers to a page when it is first written to - in
> > ->page_mkwrite() and ->write_begin callbacks.
> > 
> > Recently I have come across a code path that is probably leading to this
> > inconsistent state and I'd like to discuss how to best fix the problem
> > because it's not obvious to me. Consider the following race:
> > 
> > CPU1					CPU2
> > 
> > addr = mmap(file1, MAP_SHARED, ...);
> > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > read(fd2, addr, len)
> >   do_direct_IO()
> >     page = dio_get_page()
> >       dio_refill_pages()
> >         iov_iter_get_pages()
> > 	  get_user_pages_fast()
> >             - page fault
> >               ->page_mkwrite()
> >                 block_page_mkwrite()
> >                   lock_page(page);
> >                   - attaches buffers to page
> >                   - makes sure blocks are allocated
> >                   set_page_dirty(page)
> >               - install writeable PTE
> >               unlock_page(page);
> >     submit_page_section(page)
> >       - submits bio with 'page' as a buffer
> > 					kswapd reclaims pages:
> > 					...
> > 					shrink_page_list()
> > 					  trylock_page(page) - this is the
> > 					    page CPU1 has just faulted in
> > 					  try_to_unmap(page)
> > 					  pageout(page);
> > 					    clear_page_dirty_for_io(page);
> > 					    ->writepage()
> > 					  - let's assume page got written
> > 					    out fast enough, alternatively
> > 					    we could get to the same path as
> > 					    soon as the page IO completes
> > 					  if (page_has_private(page)) {
> > 					    try_to_release_page(page)
> > 					      - reclaims buffers from the
> > 					        page
> > 					   __remove_mapping(page)
> > 					     - fails as DIO code still
> > 					       holds page reference
> > ...
> > 
> > eventually read completes
> >   dio_bio_complete(bio)
> >     set_page_dirty_lock(page)
> >       Bummer, we've just marked the page as dirty without having buffers.
> >       Eventually writeback will find it and filesystem will complain...
> > 
> > Am I missing something?
> 
> My first question is why is kswapd trying to reclaim a page with an
> elevated active reference count? i.e. there are active references
> the VM *doesn't own* to the page, which means that there may well
> a user that expects the state on the page (e.g. the page private
> data that the active reference instantiated!) to remain intact until
> it drops it's active reference.

Page private data (and most of page state) is protected by a page lock, not
by a page reference. So reclaim (which is holding the page lock) is free to
try to reclaim page private data by calling ->releasepage callback.

That being said you are right that the attempt to reclaim a page with
active references is futile. But the problem is that we don't know how many
page references are actually left before we unmap the page from page tables
(each page table entry holds a page reference) and free page private data
(as that may hold page reference as well - e.g. attach_page_buffers()
acquires page reference). So checking page references in advance is
difficult.

Furthermore the core of the problem is not in the fact that page buffers
are reclaimed. That just makes it visible. The real problem is that page can
be written to by a GUP user while it is neither writeably mapped in page
tables nor prepared with ->write_begin. So a similar race violating
filesystem's assumptions can be like:

CPU1					CPU2

addr = mmap(file1, MAP_SHARED, ...);
fd2 = open(file2, O_DIRECT | O_RDONLY);
read(fd2, addr, len)
  do_direct_IO()
    ...
    page = get_user_pages_fast()
      - page fault handled
    submit_page_section(page)
      - submits bio with 'page' as a buffer
					ordinary writeback:
					writepages(file1)
					  clear_page_dirty_for_io(page)
					    - page gets writeprotected in
					      page tables
...
eventually read completes
  dio_bio_complete(bio)
    set_page_dirty_lock(page)

And a race like this is enough to cause data corruption if we are unlucky.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-01-04  8:52     ` Jan Kara
@ 2018-01-04 10:08       ` Dave Chinner
  -1 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2018-01-04 10:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-fsdevel, linux-xfs, linux-ext4, Dan Williams

On Thu, Jan 04, 2018 at 09:52:44AM +0100, Jan Kara wrote:
> On Thu 04-01-18 16:59:19, Dave Chinner wrote:
> > On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > Over the years I have seen so far unexplained crashed in filesystem's
> > > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > > them (see [1] and [2] for relatively recent reports). This was confusing as
> > > reclaim takes care not to strip buffers from a dirty page and both
> > > filesystems do add buffers to a page when it is first written to - in
> > > ->page_mkwrite() and ->write_begin callbacks.
> > > 
> > > Recently I have come across a code path that is probably leading to this
> > > inconsistent state and I'd like to discuss how to best fix the problem
> > > because it's not obvious to me. Consider the following race:
> > > 
> > > CPU1					CPU2
> > > 
> > > addr = mmap(file1, MAP_SHARED, ...);
> > > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > > read(fd2, addr, len)
> > >   do_direct_IO()
> > >     page = dio_get_page()
> > >       dio_refill_pages()
> > >         iov_iter_get_pages()
> > > 	  get_user_pages_fast()
> > >             - page fault
> > >               ->page_mkwrite()
> > >                 block_page_mkwrite()
> > >                   lock_page(page);
> > >                   - attaches buffers to page
> > >                   - makes sure blocks are allocated
> > >                   set_page_dirty(page)
> > >               - install writeable PTE
> > >               unlock_page(page);
> > >     submit_page_section(page)
> > >       - submits bio with 'page' as a buffer
> > > 					kswapd reclaims pages:
> > > 					...
> > > 					shrink_page_list()
> > > 					  trylock_page(page) - this is the
> > > 					    page CPU1 has just faulted in
> > > 					  try_to_unmap(page)
> > > 					  pageout(page);
> > > 					    clear_page_dirty_for_io(page);
> > > 					    ->writepage()
> > > 					  - let's assume page got written
> > > 					    out fast enough, alternatively
> > > 					    we could get to the same path as
> > > 					    soon as the page IO completes
> > > 					  if (page_has_private(page)) {
> > > 					    try_to_release_page(page)
> > > 					      - reclaims buffers from the
> > > 					        page
> > > 					   __remove_mapping(page)
> > > 					     - fails as DIO code still
> > > 					       holds page reference
> > > ...
> > > 
> > > eventually read completes
> > >   dio_bio_complete(bio)
> > >     set_page_dirty_lock(page)
> > >       Bummer, we've just marked the page as dirty without having buffers.
> > >       Eventually writeback will find it and filesystem will complain...
> > > 
> > > Am I missing something?
> > 
> > My first question is why is kswapd trying to reclaim a page with an
> > elevated active reference count? i.e. there are active references
> > the VM *doesn't own* to the page, which means that there may well
> > a user that expects the state on the page (e.g. the page private
> > data that the active reference instantiated!) to remain intact until
> > it drops it's active reference.
> 
> Page private data (and most of page state) is protected by a page lock, not
> by a page reference. So reclaim (which is holding the page lock) is free to
> try to reclaim page private data by calling ->releasepage callback.

Page private data is "owned" by whoever put the private data there.
Manipulating the fields and state that says there is private data on
the page is protected by the page lock.

> That being said you are right that the attempt to reclaim a page with
> active references is futile. But the problem is that we don't know how many
> page references are actually left before we unmap the page from page tables
> (each page table entry holds a page reference) and free page private data
> (as that may hold page reference as well - e.g. attach_page_buffers()
> acquires page reference). So checking page references in advance is
> difficult.

perhaps we need separate accounting of internal and active
references (kinda like superblocks), where active references prevent
reclaim because they require the current state to be maintained
until the reference is dropped, whilst internal references simply
prevent the page from being freed until they are released.

> Furthermore the core of the problem is not in the fact that page buffers
> are reclaimed. That just makes it visible. The real problem is that page can
> be written to by a GUP user while it is neither writeably mapped in page
> tables nor prepared with ->write_begin. So a similar race violating
> filesystem's assumptions can be like:
> 
> CPU1					CPU2
> 
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     ...
>     page = get_user_pages_fast()
>       - page fault handled
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
> 					ordinary writeback:
> 					writepages(file1)
> 					  clear_page_dirty_for_io(page)
> 					    - page gets writeprotected in
> 					      page tables
> ...
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
> 
> And a race like this is enough to cause data corruption if we are unlucky.

Hmmm. if that's the case then we probably need a page flag to
indicate the page cannot be cleaned, unmapped or reclaimed by
anything until the GUP reference owner clears that flag. It seems
analagous to the PageWriteback flag and teh way we avoid certain
things when we know the page is under IO (e.g. immediate reclaim
:P).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
@ 2018-01-04 10:08       ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2018-01-04 10:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-fsdevel, linux-xfs, linux-ext4, Dan Williams

On Thu, Jan 04, 2018 at 09:52:44AM +0100, Jan Kara wrote:
> On Thu 04-01-18 16:59:19, Dave Chinner wrote:
> > On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > Over the years I have seen so far unexplained crashed in filesystem's
> > > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > > them (see [1] and [2] for relatively recent reports). This was confusing as
> > > reclaim takes care not to strip buffers from a dirty page and both
> > > filesystems do add buffers to a page when it is first written to - in
> > > ->page_mkwrite() and ->write_begin callbacks.
> > > 
> > > Recently I have come across a code path that is probably leading to this
> > > inconsistent state and I'd like to discuss how to best fix the problem
> > > because it's not obvious to me. Consider the following race:
> > > 
> > > CPU1					CPU2
> > > 
> > > addr = mmap(file1, MAP_SHARED, ...);
> > > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > > read(fd2, addr, len)
> > >   do_direct_IO()
> > >     page = dio_get_page()
> > >       dio_refill_pages()
> > >         iov_iter_get_pages()
> > > 	  get_user_pages_fast()
> > >             - page fault
> > >               ->page_mkwrite()
> > >                 block_page_mkwrite()
> > >                   lock_page(page);
> > >                   - attaches buffers to page
> > >                   - makes sure blocks are allocated
> > >                   set_page_dirty(page)
> > >               - install writeable PTE
> > >               unlock_page(page);
> > >     submit_page_section(page)
> > >       - submits bio with 'page' as a buffer
> > > 					kswapd reclaims pages:
> > > 					...
> > > 					shrink_page_list()
> > > 					  trylock_page(page) - this is the
> > > 					    page CPU1 has just faulted in
> > > 					  try_to_unmap(page)
> > > 					  pageout(page);
> > > 					    clear_page_dirty_for_io(page);
> > > 					    ->writepage()
> > > 					  - let's assume page got written
> > > 					    out fast enough, alternatively
> > > 					    we could get to the same path as
> > > 					    soon as the page IO completes
> > > 					  if (page_has_private(page)) {
> > > 					    try_to_release_page(page)
> > > 					      - reclaims buffers from the
> > > 					        page
> > > 					   __remove_mapping(page)
> > > 					     - fails as DIO code still
> > > 					       holds page reference
> > > ...
> > > 
> > > eventually read completes
> > >   dio_bio_complete(bio)
> > >     set_page_dirty_lock(page)
> > >       Bummer, we've just marked the page as dirty without having buffers.
> > >       Eventually writeback will find it and filesystem will complain...
> > > 
> > > Am I missing something?
> > 
> > My first question is why is kswapd trying to reclaim a page with an
> > elevated active reference count? i.e. there are active references
> > the VM *doesn't own* to the page, which means that there may well
> > a user that expects the state on the page (e.g. the page private
> > data that the active reference instantiated!) to remain intact until
> > it drops it's active reference.
> 
> Page private data (and most of page state) is protected by a page lock, not
> by a page reference. So reclaim (which is holding the page lock) is free to
> try to reclaim page private data by calling ->releasepage callback.

Page private data is "owned" by whoever put the private data there.
Manipulating the fields and state that says there is private data on
the page is protected by the page lock.

> That being said you are right that the attempt to reclaim a page with
> active references is futile. But the problem is that we don't know how many
> page references are actually left before we unmap the page from page tables
> (each page table entry holds a page reference) and free page private data
> (as that may hold page reference as well - e.g. attach_page_buffers()
> acquires page reference). So checking page references in advance is
> difficult.

perhaps we need separate accounting of internal and active
references (kinda like superblocks), where active references prevent
reclaim because they require the current state to be maintained
until the reference is dropped, whilst internal references simply
prevent the page from being freed until they are released.

> Furthermore the core of the problem is not in the fact that page buffers
> are reclaimed. That just makes it visible. The real problem is that page can
> be written to by a GUP user while it is neither writeably mapped in page
> tables nor prepared with ->write_begin. So a similar race violating
> filesystem's assumptions can be like:
> 
> CPU1					CPU2
> 
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     ...
>     page = get_user_pages_fast()
>       - page fault handled
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
> 					ordinary writeback:
> 					writepages(file1)
> 					  clear_page_dirty_for_io(page)
> 					    - page gets writeprotected in
> 					      page tables
> ...
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
> 
> And a race like this is enough to cause data corruption if we are unlucky.

Hmmm. if that's the case then we probably need a page flag to
indicate the page cannot be cleaned, unmapped or reclaimed by
anything until the GUP reference owner clears that flag. It seems
analagous to the PageWriteback flag and teh way we avoid certain
things when we know the page is under IO (e.g. immediate reclaim
:P).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-01-04  4:56   ` Dan Williams
@ 2018-01-04 11:33     ` Jan Kara
  -1 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-01-04 11:33 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jan Kara, Linux MM, linux-fsdevel, linux-xfs, linux-ext4

On Wed 03-01-18 20:56:32, Dan Williams wrote:
> On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@suse.cz> wrote:
> > Hello,
> >
> > Over the years I have seen so far unexplained crashed in filesystem's
> > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > them (see [1] and [2] for relatively recent reports). This was confusing as
> > reclaim takes care not to strip buffers from a dirty page and both
> > filesystems do add buffers to a page when it is first written to - in
> > ->page_mkwrite() and ->write_begin callbacks.
> >
> > Recently I have come across a code path that is probably leading to this
> > inconsistent state and I'd like to discuss how to best fix the problem
> > because it's not obvious to me. Consider the following race:
> >
> > CPU1                                    CPU2
> >
> > addr = mmap(file1, MAP_SHARED, ...);
> > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > read(fd2, addr, len)
> >   do_direct_IO()
> >     page = dio_get_page()
> >       dio_refill_pages()
> >         iov_iter_get_pages()
> >           get_user_pages_fast()
> >             - page fault
> >               ->page_mkwrite()
> >                 block_page_mkwrite()
> >                   lock_page(page);
> >                   - attaches buffers to page
> >                   - makes sure blocks are allocated
> >                   set_page_dirty(page)
> >               - install writeable PTE
> >               unlock_page(page);
> >     submit_page_section(page)
> >       - submits bio with 'page' as a buffer
> >                                         kswapd reclaims pages:
> >                                         ...
> >                                         shrink_page_list()
> >                                           trylock_page(page) - this is the
> >                                             page CPU1 has just faulted in
> >                                           try_to_unmap(page)
> >                                           pageout(page);
> >                                             clear_page_dirty_for_io(page);
> >                                             ->writepage()
> >                                           - let's assume page got written
> >                                             out fast enough, alternatively
> >                                             we could get to the same path as
> >                                             soon as the page IO completes
> >                                           if (page_has_private(page)) {
> >                                             try_to_release_page(page)
> >                                               - reclaims buffers from the
> >                                                 page
> >                                            __remove_mapping(page)
> >                                              - fails as DIO code still
> >                                                holds page reference
> > ...
> >
> > eventually read completes
> >   dio_bio_complete(bio)
> >     set_page_dirty_lock(page)
> >       Bummer, we've just marked the page as dirty without having buffers.
> >       Eventually writeback will find it and filesystem will complain...
> >
> > Am I missing something?
> >
> > The problem here is that filesystems fundamentally assume that a page can
> > be written to only between ->write_begin - ->write_end (in this interval
> > the page is locked), or between ->page_mkwrite - ->writepage and above is
> > an example where this does not hold because when a page reference is
> > acquired through get_user_pages(), page can get written to by the holder of
> > the reference and dirtied even after it has been unmapped from page tables
> > and ->writepage has been called. This is not only a cosmetic issue leading
> > to assertion failure but it can also lead to data loss, data corruption, or
> > other unpleasant surprises as filesystems assume page contents cannot be
> > modified until either ->write_begin() or ->page_mkwrite gets called and
> > those calls are serialized by proper locking with problematic operations
> > such as hole punching etc.
> >
> > I'm not sure how to fix this problem. We could 'simulate' a writeable page
> > fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> > virtual address of the fault, don't hold mmap_sem, etc., possibly
> > expensive, but it would make filesystems happy. Data stored by GUP user
> > (e.g. read by DIO in the above case) could still get lost if someone e.g.
> > punched hole under the buffer or otherwise messed with the underlying
> > storage of the page while DIO was running but arguably users could expect
> > such outcome.
> >
> > Another possible solution would be to make sure page is writeably mapped
> > until GUP user drops its reference. That would be arguably cleaner but
> > probably that would mean we have to track number of writeable GUP page
> > references separately (no space space in struct page is a problem here) and
> > block page_mkclean() until they are dropped. Also for long term GUP users
> > like Infiniband or V4L we'd have to come up with some solution as we should
> > not block page_mkclean() for so long.
> 
> Do we need to block page_mkclean, or could we defer buffer reclaiming
> to the last put of the page?

As I wrote to Dave the problem is no so much with reclaiming of buffers but
with the fact filesystems don't expect page can be dirtied after
page_mkclean() is finished.

> I think once we have the "register memory with lease" mechanism for
> Infiniband we could expand it to the page cache case. The problem is
> the regression this would cause with userspace that expects it can
> maintain file backed memory registrations indefinitely.
> 
> What are the implications of holding off page_mkclean or release
> buffers indefinitely?

Bad. You cannot write the page to disk until page_mkclean() finishes as
page_mkclean() is part of clear_page_dirty_for_io(). And we really do need
that functionality there e.g. to make sure tail of the last page in the
file is properly zeroed out, storage with DIF/DIX can compute checksum of
the data safely before submitting it to the device etc.

> Is an indefinite / interruptible sleep waiting for the 'put' event of
> a get_user_pages() page unacceptable? The current case that the file
> contents will not be coherent with respect to in-flight RDMA, perhaps
> waiting for that to complete is better than cleaning buffers from the
> page prematurely.

Yeah, indefinite sleep is really a no-go.

> > As a side note DAX needs some solution for GUP users as well. The problems
> > are similar there in nature, just much easier to hit. So at least a
> > solution for long-term GUP users can (and I strongly believe should) be
> > shared between standard and DAX paths.
> 
> In the DAX case we rely on the fact that when the page goes idle we
> only need to worry about the filesytem block map changing, the page
> won't get reallocated somewhere else. We can't use page idle as an
> event in this case, however, if the page reference count is one then
> the DIO code can know that it has the page exclusively, so maybe DAX
> and non-DAX can share the page count == 1 event notification.

The races I describe do not need to involve truncate / hole punching. It is
just enough to race with page writeback. So page references are of no use
here. We would have to specifically track number of references acquired by
GUP or something like that. So what I wanted to share with DAX is the
long-term pin handling, the rest is unclear for now.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
@ 2018-01-04 11:33     ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2018-01-04 11:33 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jan Kara, Linux MM, linux-fsdevel, linux-xfs, linux-ext4

On Wed 03-01-18 20:56:32, Dan Williams wrote:
> On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@suse.cz> wrote:
> > Hello,
> >
> > Over the years I have seen so far unexplained crashed in filesystem's
> > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > them (see [1] and [2] for relatively recent reports). This was confusing as
> > reclaim takes care not to strip buffers from a dirty page and both
> > filesystems do add buffers to a page when it is first written to - in
> > ->page_mkwrite() and ->write_begin callbacks.
> >
> > Recently I have come across a code path that is probably leading to this
> > inconsistent state and I'd like to discuss how to best fix the problem
> > because it's not obvious to me. Consider the following race:
> >
> > CPU1                                    CPU2
> >
> > addr = mmap(file1, MAP_SHARED, ...);
> > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > read(fd2, addr, len)
> >   do_direct_IO()
> >     page = dio_get_page()
> >       dio_refill_pages()
> >         iov_iter_get_pages()
> >           get_user_pages_fast()
> >             - page fault
> >               ->page_mkwrite()
> >                 block_page_mkwrite()
> >                   lock_page(page);
> >                   - attaches buffers to page
> >                   - makes sure blocks are allocated
> >                   set_page_dirty(page)
> >               - install writeable PTE
> >               unlock_page(page);
> >     submit_page_section(page)
> >       - submits bio with 'page' as a buffer
> >                                         kswapd reclaims pages:
> >                                         ...
> >                                         shrink_page_list()
> >                                           trylock_page(page) - this is the
> >                                             page CPU1 has just faulted in
> >                                           try_to_unmap(page)
> >                                           pageout(page);
> >                                             clear_page_dirty_for_io(page);
> >                                             ->writepage()
> >                                           - let's assume page got written
> >                                             out fast enough, alternatively
> >                                             we could get to the same path as
> >                                             soon as the page IO completes
> >                                           if (page_has_private(page)) {
> >                                             try_to_release_page(page)
> >                                               - reclaims buffers from the
> >                                                 page
> >                                            __remove_mapping(page)
> >                                              - fails as DIO code still
> >                                                holds page reference
> > ...
> >
> > eventually read completes
> >   dio_bio_complete(bio)
> >     set_page_dirty_lock(page)
> >       Bummer, we've just marked the page as dirty without having buffers.
> >       Eventually writeback will find it and filesystem will complain...
> >
> > Am I missing something?
> >
> > The problem here is that filesystems fundamentally assume that a page can
> > be written to only between ->write_begin - ->write_end (in this interval
> > the page is locked), or between ->page_mkwrite - ->writepage and above is
> > an example where this does not hold because when a page reference is
> > acquired through get_user_pages(), page can get written to by the holder of
> > the reference and dirtied even after it has been unmapped from page tables
> > and ->writepage has been called. This is not only a cosmetic issue leading
> > to assertion failure but it can also lead to data loss, data corruption, or
> > other unpleasant surprises as filesystems assume page contents cannot be
> > modified until either ->write_begin() or ->page_mkwrite gets called and
> > those calls are serialized by proper locking with problematic operations
> > such as hole punching etc.
> >
> > I'm not sure how to fix this problem. We could 'simulate' a writeable page
> > fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> > virtual address of the fault, don't hold mmap_sem, etc., possibly
> > expensive, but it would make filesystems happy. Data stored by GUP user
> > (e.g. read by DIO in the above case) could still get lost if someone e.g.
> > punched hole under the buffer or otherwise messed with the underlying
> > storage of the page while DIO was running but arguably users could expect
> > such outcome.
> >
> > Another possible solution would be to make sure page is writeably mapped
> > until GUP user drops its reference. That would be arguably cleaner but
> > probably that would mean we have to track number of writeable GUP page
> > references separately (no space space in struct page is a problem here) and
> > block page_mkclean() until they are dropped. Also for long term GUP users
> > like Infiniband or V4L we'd have to come up with some solution as we should
> > not block page_mkclean() for so long.
> 
> Do we need to block page_mkclean, or could we defer buffer reclaiming
> to the last put of the page?

As I wrote to Dave the problem is no so much with reclaiming of buffers but
with the fact filesystems don't expect page can be dirtied after
page_mkclean() is finished.

> I think once we have the "register memory with lease" mechanism for
> Infiniband we could expand it to the page cache case. The problem is
> the regression this would cause with userspace that expects it can
> maintain file backed memory registrations indefinitely.
> 
> What are the implications of holding off page_mkclean or release
> buffers indefinitely?

Bad. You cannot write the page to disk until page_mkclean() finishes as
page_mkclean() is part of clear_page_dirty_for_io(). And we really do need
that functionality there e.g. to make sure tail of the last page in the
file is properly zeroed out, storage with DIF/DIX can compute checksum of
the data safely before submitting it to the device etc.

> Is an indefinite / interruptible sleep waiting for the 'put' event of
> a get_user_pages() page unacceptable? The current case that the file
> contents will not be coherent with respect to in-flight RDMA, perhaps
> waiting for that to complete is better than cleaning buffers from the
> page prematurely.

Yeah, indefinite sleep is really a no-go.

> > As a side note DAX needs some solution for GUP users as well. The problems
> > are similar there in nature, just much easier to hit. So at least a
> > solution for long-term GUP users can (and I strongly believe should) be
> > shared between standard and DAX paths.
> 
> In the DAX case we rely on the fact that when the page goes idle we
> only need to worry about the filesytem block map changing, the page
> won't get reallocated somewhere else. We can't use page idle as an
> event in this case, however, if the page reference count is one then
> the DIO code can know that it has the page exclusively, so maybe DAX
> and non-DAX can share the page count == 1 event notification.

The races I describe do not need to involve truncate / hole punching. It is
just enough to race with page writeback. So page references are of no use
here. We would have to specifically track number of references acquired by
GUP or something like that. So what I wanted to share with DAX is the
long-term pin handling, the rest is unclear for now.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-01-04 11:33     ` Jan Kara
  (?)
@ 2018-04-13 12:39     ` Gavin Guo
  2018-04-13 12:58       ` Jan Kara
  -1 siblings, 1 reply; 16+ messages in thread
From: Gavin Guo @ 2018-04-13 12:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dan Williams, Linux MM, linux-fsdevel, linux-xfs, linux-ext4

Hi all,

On Thu, Jan 4, 2018 at 7:33 PM, Jan Kara <jack@suse.cz> wrote:
>
> On Wed 03-01-18 20:56:32, Dan Williams wrote:
> > On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@suse.cz> wrote:
> > > Hello,
> > >
> > > Over the years I have seen so far unexplained crashed in filesystem's
> > > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > > them (see [1] and [2] for relatively recent reports). This was confusing as
> > > reclaim takes care not to strip buffers from a dirty page and both
> > > filesystems do add buffers to a page when it is first written to - in
> > > ->page_mkwrite() and ->write_begin callbacks.
> > >
> > > Recently I have come across a code path that is probably leading to this
> > > inconsistent state and I'd like to discuss how to best fix the problem
> > > because it's not obvious to me. Consider the following race:
> > >
> > > CPU1                                    CPU2
> > >
> > > addr = mmap(file1, MAP_SHARED, ...);
> > > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > > read(fd2, addr, len)
> > >   do_direct_IO()
> > >     page = dio_get_page()
> > >       dio_refill_pages()
> > >         iov_iter_get_pages()
> > >           get_user_pages_fast()
> > >             - page fault
> > >               ->page_mkwrite()
> > >                 block_page_mkwrite()
> > >                   lock_page(page);
> > >                   - attaches buffers to page
> > >                   - makes sure blocks are allocated
> > >                   set_page_dirty(page)
> > >               - install writeable PTE
> > >               unlock_page(page);
> > >     submit_page_section(page)
> > >       - submits bio with 'page' as a buffer
> > >                                         kswapd reclaims pages:
> > >                                         ...
> > >                                         shrink_page_list()
> > >                                           trylock_page(page) - this is the
> > >                                             page CPU1 has just faulted in
> > >                                           try_to_unmap(page)
> > >                                           pageout(page);
> > >                                             clear_page_dirty_for_io(page);
> > >                                             ->writepage()
> > >                                           - let's assume page got written
> > >                                             out fast enough, alternatively
> > >                                             we could get to the same path as
> > >                                             soon as the page IO completes
> > >                                           if (page_has_private(page)) {
> > >                                             try_to_release_page(page)
> > >                                               - reclaims buffers from the
> > >                                                 page
> > >                                            __remove_mapping(page)
> > >                                              - fails as DIO code still
> > >                                                holds page reference
> > > ...
> > >
> > > eventually read completes
> > >   dio_bio_complete(bio)
> > >     set_page_dirty_lock(page)
> > >       Bummer, we've just marked the page as dirty without having buffers.
> > >       Eventually writeback will find it and filesystem will complain...
> > >
> > > Am I missing something?
> > >
> > > The problem here is that filesystems fundamentally assume that a page can
> > > be written to only between ->write_begin - ->write_end (in this interval
> > > the page is locked), or between ->page_mkwrite - ->writepage and above is
> > > an example where this does not hold because when a page reference is
> > > acquired through get_user_pages(), page can get written to by the holder of
> > > the reference and dirtied even after it has been unmapped from page tables
> > > and ->writepage has been called. This is not only a cosmetic issue leading
> > > to assertion failure but it can also lead to data loss, data corruption, or
> > > other unpleasant surprises as filesystems assume page contents cannot be
> > > modified until either ->write_begin() or ->page_mkwrite gets called and
> > > those calls are serialized by proper locking with problematic operations
> > > such as hole punching etc.
> > >
> > > I'm not sure how to fix this problem. We could 'simulate' a writeable page
> > > fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> > > virtual address of the fault, don't hold mmap_sem, etc., possibly
> > > expensive, but it would make filesystems happy. Data stored by GUP user
> > > (e.g. read by DIO in the above case) could still get lost if someone e.g.
> > > punched hole under the buffer or otherwise messed with the underlying
> > > storage of the page while DIO was running but arguably users could expect
> > > such outcome.
> > >
> > > Another possible solution would be to make sure page is writeably mapped
> > > until GUP user drops its reference. That would be arguably cleaner but
> > > probably that would mean we have to track number of writeable GUP page
> > > references separately (no space space in struct page is a problem here) and
> > > block page_mkclean() until they are dropped. Also for long term GUP users
> > > like Infiniband or V4L we'd have to come up with some solution as we should
> > > not block page_mkclean() for so long.
> >
> > Do we need to block page_mkclean, or could we defer buffer reclaiming
> > to the last put of the page?
>
> As I wrote to Dave the problem is no so much with reclaiming of buffers but
> with the fact filesystems don't expect page can be dirtied after
> page_mkclean() is finished.
>
> > I think once we have the "register memory with lease" mechanism for
> > Infiniband we could expand it to the page cache case. The problem is
> > the regression this would cause with userspace that expects it can
> > maintain file backed memory registrations indefinitely.
> >
> > What are the implications of holding off page_mkclean or release
> > buffers indefinitely?
>
> Bad. You cannot write the page to disk until page_mkclean() finishes as
> page_mkclean() is part of clear_page_dirty_for_io(). And we really do need
> that functionality there e.g. to make sure tail of the last page in the
> file is properly zeroed out, storage with DIF/DIX can compute checksum of
> the data safely before submitting it to the device etc.
>
> > Is an indefinite / interruptible sleep waiting for the 'put' event of
> > a get_user_pages() page unacceptable? The current case that the file
> > contents will not be coherent with respect to in-flight RDMA, perhaps
> > waiting for that to complete is better than cleaning buffers from the
> > page prematurely.
>
> Yeah, indefinite sleep is really a no-go.
>
> > > As a side note DAX needs some solution for GUP users as well. The problems
> > > are similar there in nature, just much easier to hit. So at least a
> > > solution for long-term GUP users can (and I strongly believe should) be
> > > shared between standard and DAX paths.
> >
> > In the DAX case we rely on the fact that when the page goes idle we
> > only need to worry about the filesytem block map changing, the page
> > won't get reallocated somewhere else. We can't use page idle as an
> > event in this case, however, if the page reference count is one then
> > the DIO code can know that it has the page exclusively, so maybe DAX
> > and non-DAX can share the page count == 1 event notification.
>
> The races I describe do not need to involve truncate / hole punching. It is
> just enough to race with page writeback. So page references are of no use
> here. We would have to specifically track number of references acquired by
> GUP or something like that. So what I wanted to share with DAX is the
> long-term pin handling, the rest is unclear for now.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

The bug can be reliably reproduced in our platform with the
current upstream kernel(80aa76bcd364 Merge tag 'xfs-4.17-merge-4' of
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux). I'm happy to
help to test and debug. The error message is as following:

kernel BUG at /home/gavin/work-kernel/fs/ext4/inode.c:2126!
invalid opcode: 0000 [#1] SMP PTI
Modules linked in: veth ipt_MASQUERADE nf_nat_masquerade_ipv4
nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat
nf_conntrack br_netfilter bridge stp llc overlay xt_multiport
iptable_filter ip_tables x_tables cachefiles fscache esp6_offload esp6
esp4_offload esp4 xfrm_algo nls_iso8859_1 intel_rapl sb_edac
x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel
nvidia_uvm(POE) mxm_wmi kvm joydev input_leds irqbypass intel_cstate
intel_rapl_perf mei_me ipmi_si shpchp lpc_ich mei acpi_power_meter
mac_hid wmi ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
libiscsi scsi_transport_iscsi ipmi_devintf sunrpc ipmi_msghandler
autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
async_memcpy
 async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
multipath linear i2c_algo_bit nvidia_drm(POE) ses crct10dif_pclmul
crc32_pclmul nvidia_modeset(POE) ttm ghash_clmulni_intel enclosure
hid_generic uas pcbc scsi_transport_sas drm_kms_helper usbhid
aesni_intel hid aes_x86_64 usb_storage syscopyarea crypto_simd
sysfillrect mlx5_core cryptd nvidia(POE) sysimgblt glue_helper ixgbe
mlxfw megaraid_sas fb_sys_fops devlink dca ahci ptp drm libahci
pps_core mdio
CPU: 54 PID: 8938 Comm: kworker/u161:0 Tainted: P           OE
4.16.0-999-generic #201804102200

Workqueue: writeback wb_workfn (flush-8:0)
RIP: 0010:ext4_writepage+0x318/0x770
RSP: 0018:ffffb514e76cb7f8 EFLAGS: 00010246
RAX: 00500b4e8000026d RBX: 0000000000001000 RCX: ffff8c2cafba5000
RDX: ffff8bec4069a020 RSI: ffffb514e76cbc28 RDI: ffffe91efcc8bd80
RBP: ffffb514e76cb870 R08: 0000000000028115 R09: 00000000000280c0
R10: 0000000000000002 R11: ffff8c2dbffd4000 R12: ffffe91efcc8bd80
R13: ffff8bec40699ea8 R14: ffffb514e76cbc28 R15: ffffe91efcc8bd80
FS:  0000000000000000(0000) GS:ffff8becbfc80000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffcc8f6f080 CR3: 0000004b9ce0a006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 ? rmap_walk+0x41/0x60
 ? page_mkclean+0x9f/0xb0
 ? invalid_page_referenced_vma+0x80/0x80
 __writepage+0x17/0x50
 write_cache_pages+0x228/0x4a0
 ? __wb_calc_thresh+0x140/0x140
 generic_writepages+0x61/0xa0
 ? _cond_resched+0x1a/0x50
 ? write_cache_pages+0x396/0x4a0
 ext4_writepages+0x1fc/0xe00
 ? ext4_writepages+0x1fc/0xe00
 ? generic_writepages+0x6d/0xa0
 ? fprop_fraction_percpu+0x2f/0x80
 do_writepages+0x1c/0x60
 ? do_writepages+0x1c/0x60
 __writeback_single_inode+0x45/0x320
 writeback_sb_inodes+0x266/0x580
 __writeback_inodes_wb+0x92/0xc0
 wb_writeback+0x282/0x310
 wb_workfn+0x1a3/0x440
 ? wb_workfn+0x1a3/0x440
 process_one_work+0x1db/0x3c0
 worker_thread+0x4b/0x420
 kthread+0x102/0x140
 ? rescuer_thread+0x380/0x380
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x35/0x40
Code: ff f6 c4 08 0f 85 58 ff ff ff e8 68 56 00 00 ba 00 10 00 00 31
f6 41 bd fb ff ff ff e8 a2 9e ff ff 4c 89 e7 e8 fa 39 ea ff eb 8a <0f>
0b c6 45 a8 00 e9 f0 fd ff ff 49 83 7c 24 10 00 0f 85 c7 03
RIP: ext4_writepage+0x318/0x770 RSP: ffffb514e76cb7f8
---[ end trace 59d4e1a4b221404b ]---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-04-13 12:39     ` Gavin Guo
@ 2018-04-13 12:58       ` Jan Kara
  2018-04-16  9:20         ` Gavin Guo
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2018-04-13 12:58 UTC (permalink / raw)
  To: Gavin Guo
  Cc: Jan Kara, Dan Williams, Linux MM, linux-fsdevel, linux-xfs, linux-ext4

Hi!

On Fri 13-04-18 20:39:06, Gavin Guo wrote:
> On Thu, Jan 4, 2018 at 7:33 PM, Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 03-01-18 20:56:32, Dan Williams wrote:
> > > On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@suse.cz> wrote:
> > > > Hello,
> > > >
> > > > Over the years I have seen so far unexplained crashed in filesystem's
> > > > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > > > them (see [1] and [2] for relatively recent reports). This was confusing as
> > > > reclaim takes care not to strip buffers from a dirty page and both
> > > > filesystems do add buffers to a page when it is first written to - in
> > > > ->page_mkwrite() and ->write_begin callbacks.
> > > >
> > > > Recently I have come across a code path that is probably leading to this
> > > > inconsistent state and I'd like to discuss how to best fix the problem
> > > > because it's not obvious to me. Consider the following race:
> > > >
> > > > CPU1                                    CPU2
> > > >
> > > > addr = mmap(file1, MAP_SHARED, ...);
> > > > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > > > read(fd2, addr, len)
> > > >   do_direct_IO()
> > > >     page = dio_get_page()
> > > >       dio_refill_pages()
> > > >         iov_iter_get_pages()
> > > >           get_user_pages_fast()
> > > >             - page fault
> > > >               ->page_mkwrite()
> > > >                 block_page_mkwrite()
> > > >                   lock_page(page);
> > > >                   - attaches buffers to page
> > > >                   - makes sure blocks are allocated
> > > >                   set_page_dirty(page)
> > > >               - install writeable PTE
> > > >               unlock_page(page);
> > > >     submit_page_section(page)
> > > >       - submits bio with 'page' as a buffer
> > > >                                         kswapd reclaims pages:
> > > >                                         ...
> > > >                                         shrink_page_list()
> > > >                                           trylock_page(page) - this is the
> > > >                                             page CPU1 has just faulted in
> > > >                                           try_to_unmap(page)
> > > >                                           pageout(page);
> > > >                                             clear_page_dirty_for_io(page);
> > > >                                             ->writepage()
> > > >                                           - let's assume page got written
> > > >                                             out fast enough, alternatively
> > > >                                             we could get to the same path as
> > > >                                             soon as the page IO completes
> > > >                                           if (page_has_private(page)) {
> > > >                                             try_to_release_page(page)
> > > >                                               - reclaims buffers from the
> > > >                                                 page
> > > >                                            __remove_mapping(page)
> > > >                                              - fails as DIO code still
> > > >                                                holds page reference
> > > > ...
> > > >
> > > > eventually read completes
> > > >   dio_bio_complete(bio)
> > > >     set_page_dirty_lock(page)
> > > >       Bummer, we've just marked the page as dirty without having buffers.
> > > >       Eventually writeback will find it and filesystem will complain...
> > > >
> > > > Am I missing something?
> > > >
> > > > The problem here is that filesystems fundamentally assume that a page can
> > > > be written to only between ->write_begin - ->write_end (in this interval
> > > > the page is locked), or between ->page_mkwrite - ->writepage and above is
> > > > an example where this does not hold because when a page reference is
> > > > acquired through get_user_pages(), page can get written to by the holder of
> > > > the reference and dirtied even after it has been unmapped from page tables
> > > > and ->writepage has been called. This is not only a cosmetic issue leading
> > > > to assertion failure but it can also lead to data loss, data corruption, or
> > > > other unpleasant surprises as filesystems assume page contents cannot be
> > > > modified until either ->write_begin() or ->page_mkwrite gets called and
> > > > those calls are serialized by proper locking with problematic operations
> > > > such as hole punching etc.
> > > >
> > > > I'm not sure how to fix this problem. We could 'simulate' a writeable page
> > > > fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> > > > virtual address of the fault, don't hold mmap_sem, etc., possibly
> > > > expensive, but it would make filesystems happy. Data stored by GUP user
> > > > (e.g. read by DIO in the above case) could still get lost if someone e.g.
> > > > punched hole under the buffer or otherwise messed with the underlying
> > > > storage of the page while DIO was running but arguably users could expect
> > > > such outcome.
> > > >
> > > > Another possible solution would be to make sure page is writeably mapped
> > > > until GUP user drops its reference. That would be arguably cleaner but
> > > > probably that would mean we have to track number of writeable GUP page
> > > > references separately (no space space in struct page is a problem here) and
> > > > block page_mkclean() until they are dropped. Also for long term GUP users
> > > > like Infiniband or V4L we'd have to come up with some solution as we should
> > > > not block page_mkclean() for so long.
> > >
> > > Do we need to block page_mkclean, or could we defer buffer reclaiming
> > > to the last put of the page?
> >
> > As I wrote to Dave the problem is no so much with reclaiming of buffers but
> > with the fact filesystems don't expect page can be dirtied after
> > page_mkclean() is finished.
> >
> > > I think once we have the "register memory with lease" mechanism for
> > > Infiniband we could expand it to the page cache case. The problem is
> > > the regression this would cause with userspace that expects it can
> > > maintain file backed memory registrations indefinitely.
> > >
> > > What are the implications of holding off page_mkclean or release
> > > buffers indefinitely?
> >
> > Bad. You cannot write the page to disk until page_mkclean() finishes as
> > page_mkclean() is part of clear_page_dirty_for_io(). And we really do need
> > that functionality there e.g. to make sure tail of the last page in the
> > file is properly zeroed out, storage with DIF/DIX can compute checksum of
> > the data safely before submitting it to the device etc.
> >
> > > Is an indefinite / interruptible sleep waiting for the 'put' event of
> > > a get_user_pages() page unacceptable? The current case that the file
> > > contents will not be coherent with respect to in-flight RDMA, perhaps
> > > waiting for that to complete is better than cleaning buffers from the
> > > page prematurely.
> >
> > Yeah, indefinite sleep is really a no-go.
> >
> > > > As a side note DAX needs some solution for GUP users as well. The problems
> > > > are similar there in nature, just much easier to hit. So at least a
> > > > solution for long-term GUP users can (and I strongly believe should) be
> > > > shared between standard and DAX paths.
> > >
> > > In the DAX case we rely on the fact that when the page goes idle we
> > > only need to worry about the filesytem block map changing, the page
> > > won't get reallocated somewhere else. We can't use page idle as an
> > > event in this case, however, if the page reference count is one then
> > > the DIO code can know that it has the page exclusively, so maybe DAX
> > > and non-DAX can share the page count == 1 event notification.
> >
> > The races I describe do not need to involve truncate / hole punching. It is
> > just enough to race with page writeback. So page references are of no use
> > here. We would have to specifically track number of references acquired by
> > GUP or something like that. So what I wanted to share with DAX is the
> > long-term pin handling, the rest is unclear for now.
> >
> >                                                                 Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> The bug can be reliably reproduced in our platform with the
> current upstream kernel(80aa76bcd364 Merge tag 'xfs-4.17-merge-4' of
> git://git.kernel.org/pub/scm/fs/xfs/xfs-linux). I'm happy to
> help to test and debug. The error message is as following:

Thanks for report! So what workload do you run to trigger this? Is it just
a direct IO read to a buffer in a shared file mapping or something else?

								Honza

> kernel BUG at /home/gavin/work-kernel/fs/ext4/inode.c:2126!
> invalid opcode: 0000 [#1] SMP PTI
> Modules linked in: veth ipt_MASQUERADE nf_nat_masquerade_ipv4
> nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat
> nf_conntrack br_netfilter bridge stp llc overlay xt_multiport
> iptable_filter ip_tables x_tables cachefiles fscache esp6_offload esp6
> esp4_offload esp4 xfrm_algo nls_iso8859_1 intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel
> nvidia_uvm(POE) mxm_wmi kvm joydev input_leds irqbypass intel_cstate
> intel_rapl_perf mei_me ipmi_si shpchp lpc_ich mei acpi_power_meter
> mac_hid wmi ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
> libiscsi scsi_transport_iscsi ipmi_devintf sunrpc ipmi_msghandler
> autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
> async_memcpy
>  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
> multipath linear i2c_algo_bit nvidia_drm(POE) ses crct10dif_pclmul
> crc32_pclmul nvidia_modeset(POE) ttm ghash_clmulni_intel enclosure
> hid_generic uas pcbc scsi_transport_sas drm_kms_helper usbhid
> aesni_intel hid aes_x86_64 usb_storage syscopyarea crypto_simd
> sysfillrect mlx5_core cryptd nvidia(POE) sysimgblt glue_helper ixgbe
> mlxfw megaraid_sas fb_sys_fops devlink dca ahci ptp drm libahci
> pps_core mdio
> CPU: 54 PID: 8938 Comm: kworker/u161:0 Tainted: P           OE
> 4.16.0-999-generic #201804102200
> 
> Workqueue: writeback wb_workfn (flush-8:0)
> RIP: 0010:ext4_writepage+0x318/0x770
> RSP: 0018:ffffb514e76cb7f8 EFLAGS: 00010246
> RAX: 00500b4e8000026d RBX: 0000000000001000 RCX: ffff8c2cafba5000
> RDX: ffff8bec4069a020 RSI: ffffb514e76cbc28 RDI: ffffe91efcc8bd80
> RBP: ffffb514e76cb870 R08: 0000000000028115 R09: 00000000000280c0
> R10: 0000000000000002 R11: ffff8c2dbffd4000 R12: ffffe91efcc8bd80
> R13: ffff8bec40699ea8 R14: ffffb514e76cbc28 R15: ffffe91efcc8bd80
> FS:  0000000000000000(0000) GS:ffff8becbfc80000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffcc8f6f080 CR3: 0000004b9ce0a006 CR4: 00000000003606e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  ? rmap_walk+0x41/0x60
>  ? page_mkclean+0x9f/0xb0
>  ? invalid_page_referenced_vma+0x80/0x80
>  __writepage+0x17/0x50
>  write_cache_pages+0x228/0x4a0
>  ? __wb_calc_thresh+0x140/0x140
>  generic_writepages+0x61/0xa0
>  ? _cond_resched+0x1a/0x50
>  ? write_cache_pages+0x396/0x4a0
>  ext4_writepages+0x1fc/0xe00
>  ? ext4_writepages+0x1fc/0xe00
>  ? generic_writepages+0x6d/0xa0
>  ? fprop_fraction_percpu+0x2f/0x80
>  do_writepages+0x1c/0x60
>  ? do_writepages+0x1c/0x60
>  __writeback_single_inode+0x45/0x320
>  writeback_sb_inodes+0x266/0x580
>  __writeback_inodes_wb+0x92/0xc0
>  wb_writeback+0x282/0x310
>  wb_workfn+0x1a3/0x440
>  ? wb_workfn+0x1a3/0x440
>  process_one_work+0x1db/0x3c0
>  worker_thread+0x4b/0x420
>  kthread+0x102/0x140
>  ? rescuer_thread+0x380/0x380
>  ? kthread_create_worker_on_cpu+0x70/0x70
>  ret_from_fork+0x35/0x40
> Code: ff f6 c4 08 0f 85 58 ff ff ff e8 68 56 00 00 ba 00 10 00 00 31
> f6 41 bd fb ff ff ff e8 a2 9e ff ff 4c 89 e7 e8 fa 39 ea ff eb 8a <0f>
> 0b c6 45 a8 00 e9 f0 fd ff ff 49 83 7c 24 10 00 0f 85 c7 03
> RIP: ext4_writepage+0x318/0x770 RSP: ffffb514e76cb7f8
> ---[ end trace 59d4e1a4b221404b ]---
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Filesystem crashes due to pages without buffers
  2018-04-13 12:58       ` Jan Kara
@ 2018-04-16  9:20         ` Gavin Guo
  0 siblings, 0 replies; 16+ messages in thread
From: Gavin Guo @ 2018-04-16  9:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dan Williams, Linux MM, linux-fsdevel, linux-xfs, linux-ext4

On Fri, Apr 13, 2018 at 8:58 PM, Jan Kara <jack@suse.cz> wrote:
> Hi!
>
> On Fri 13-04-18 20:39:06, Gavin Guo wrote:
>> On Thu, Jan 4, 2018 at 7:33 PM, Jan Kara <jack@suse.cz> wrote:
>> >
>> > On Wed 03-01-18 20:56:32, Dan Williams wrote:
>> > > On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@suse.cz> wrote:
>> > > > Hello,
>> > > >
>> > > > Over the years I have seen so far unexplained crashed in filesystem's
>> > > > (ext4, xfs) writeback path due to dirty pages without buffers attached to
>> > > > them (see [1] and [2] for relatively recent reports). This was confusing as
>> > > > reclaim takes care not to strip buffers from a dirty page and both
>> > > > filesystems do add buffers to a page when it is first written to - in
>> > > > ->page_mkwrite() and ->write_begin callbacks.
>> > > >
>> > > > Recently I have come across a code path that is probably leading to this
>> > > > inconsistent state and I'd like to discuss how to best fix the problem
>> > > > because it's not obvious to me. Consider the following race:
>> > > >
>> > > > CPU1                                    CPU2
>> > > >
>> > > > addr = mmap(file1, MAP_SHARED, ...);
>> > > > fd2 = open(file2, O_DIRECT | O_RDONLY);
>> > > > read(fd2, addr, len)
>> > > >   do_direct_IO()
>> > > >     page = dio_get_page()
>> > > >       dio_refill_pages()
>> > > >         iov_iter_get_pages()
>> > > >           get_user_pages_fast()
>> > > >             - page fault
>> > > >               ->page_mkwrite()
>> > > >                 block_page_mkwrite()
>> > > >                   lock_page(page);
>> > > >                   - attaches buffers to page
>> > > >                   - makes sure blocks are allocated
>> > > >                   set_page_dirty(page)
>> > > >               - install writeable PTE
>> > > >               unlock_page(page);
>> > > >     submit_page_section(page)
>> > > >       - submits bio with 'page' as a buffer
>> > > >                                         kswapd reclaims pages:
>> > > >                                         ...
>> > > >                                         shrink_page_list()
>> > > >                                           trylock_page(page) - this is the
>> > > >                                             page CPU1 has just faulted in
>> > > >                                           try_to_unmap(page)
>> > > >                                           pageout(page);
>> > > >                                             clear_page_dirty_for_io(page);
>> > > >                                             ->writepage()
>> > > >                                           - let's assume page got written
>> > > >                                             out fast enough, alternatively
>> > > >                                             we could get to the same path as
>> > > >                                             soon as the page IO completes
>> > > >                                           if (page_has_private(page)) {
>> > > >                                             try_to_release_page(page)
>> > > >                                               - reclaims buffers from the
>> > > >                                                 page
>> > > >                                            __remove_mapping(page)
>> > > >                                              - fails as DIO code still
>> > > >                                                holds page reference
>> > > > ...
>> > > >
>> > > > eventually read completes
>> > > >   dio_bio_complete(bio)
>> > > >     set_page_dirty_lock(page)
>> > > >       Bummer, we've just marked the page as dirty without having buffers.
>> > > >       Eventually writeback will find it and filesystem will complain...
>> > > >
>> > > > Am I missing something?
>> > > >
>> > > > The problem here is that filesystems fundamentally assume that a page can
>> > > > be written to only between ->write_begin - ->write_end (in this interval
>> > > > the page is locked), or between ->page_mkwrite - ->writepage and above is
>> > > > an example where this does not hold because when a page reference is
>> > > > acquired through get_user_pages(), page can get written to by the holder of
>> > > > the reference and dirtied even after it has been unmapped from page tables
>> > > > and ->writepage has been called. This is not only a cosmetic issue leading
>> > > > to assertion failure but it can also lead to data loss, data corruption, or
>> > > > other unpleasant surprises as filesystems assume page contents cannot be
>> > > > modified until either ->write_begin() or ->page_mkwrite gets called and
>> > > > those calls are serialized by proper locking with problematic operations
>> > > > such as hole punching etc.
>> > > >
>> > > > I'm not sure how to fix this problem. We could 'simulate' a writeable page
>> > > > fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
>> > > > virtual address of the fault, don't hold mmap_sem, etc., possibly
>> > > > expensive, but it would make filesystems happy. Data stored by GUP user
>> > > > (e.g. read by DIO in the above case) could still get lost if someone e.g.
>> > > > punched hole under the buffer or otherwise messed with the underlying
>> > > > storage of the page while DIO was running but arguably users could expect
>> > > > such outcome.
>> > > >
>> > > > Another possible solution would be to make sure page is writeably mapped
>> > > > until GUP user drops its reference. That would be arguably cleaner but
>> > > > probably that would mean we have to track number of writeable GUP page
>> > > > references separately (no space space in struct page is a problem here) and
>> > > > block page_mkclean() until they are dropped. Also for long term GUP users
>> > > > like Infiniband or V4L we'd have to come up with some solution as we should
>> > > > not block page_mkclean() for so long.
>> > >
>> > > Do we need to block page_mkclean, or could we defer buffer reclaiming
>> > > to the last put of the page?
>> >
>> > As I wrote to Dave the problem is no so much with reclaiming of buffers but
>> > with the fact filesystems don't expect page can be dirtied after
>> > page_mkclean() is finished.
>> >
>> > > I think once we have the "register memory with lease" mechanism for
>> > > Infiniband we could expand it to the page cache case. The problem is
>> > > the regression this would cause with userspace that expects it can
>> > > maintain file backed memory registrations indefinitely.
>> > >
>> > > What are the implications of holding off page_mkclean or release
>> > > buffers indefinitely?
>> >
>> > Bad. You cannot write the page to disk until page_mkclean() finishes as
>> > page_mkclean() is part of clear_page_dirty_for_io(). And we really do need
>> > that functionality there e.g. to make sure tail of the last page in the
>> > file is properly zeroed out, storage with DIF/DIX can compute checksum of
>> > the data safely before submitting it to the device etc.
>> >
>> > > Is an indefinite / interruptible sleep waiting for the 'put' event of
>> > > a get_user_pages() page unacceptable? The current case that the file
>> > > contents will not be coherent with respect to in-flight RDMA, perhaps
>> > > waiting for that to complete is better than cleaning buffers from the
>> > > page prematurely.
>> >
>> > Yeah, indefinite sleep is really a no-go.
>> >
>> > > > As a side note DAX needs some solution for GUP users as well. The problems
>> > > > are similar there in nature, just much easier to hit. So at least a
>> > > > solution for long-term GUP users can (and I strongly believe should) be
>> > > > shared between standard and DAX paths.
>> > >
>> > > In the DAX case we rely on the fact that when the page goes idle we
>> > > only need to worry about the filesytem block map changing, the page
>> > > won't get reallocated somewhere else. We can't use page idle as an
>> > > event in this case, however, if the page reference count is one then
>> > > the DIO code can know that it has the page exclusively, so maybe DAX
>> > > and non-DAX can share the page count == 1 event notification.
>> >
>> > The races I describe do not need to involve truncate / hole punching. It is
>> > just enough to race with page writeback. So page references are of no use
>> > here. We would have to specifically track number of references acquired by
>> > GUP or something like that. So what I wanted to share with DAX is the
>> > long-term pin handling, the rest is unclear for now.
>> >
>> >                                                                 Honza
>> > --
>> > Jan Kara <jack@suse.com>
>> > SUSE Labs, CR
>> >
>> > --
>> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> > the body to majordomo@kvack.org.  For more info on Linux MM,
>> > see: http://www.linux-mm.org/ .
>> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>> The bug can be reliably reproduced in our platform with the
>> current upstream kernel(80aa76bcd364 Merge tag 'xfs-4.17-merge-4' of
>> git://git.kernel.org/pub/scm/fs/xfs/xfs-linux). I'm happy to
>> help to test and debug. The error message is as following:
>
> Thanks for report! So what workload do you run to trigger this? Is it just
> a direct IO read to a buffer in a shared file mapping or something else?
>
>                                                                 Honza
>
>> kernel BUG at /home/gavin/work-kernel/fs/ext4/inode.c:2126!
>> invalid opcode: 0000 [#1] SMP PTI
>> Modules linked in: veth ipt_MASQUERADE nf_nat_masquerade_ipv4
>> nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_conntrack_ipv4
>> nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat
>> nf_conntrack br_netfilter bridge stp llc overlay xt_multiport
>> iptable_filter ip_tables x_tables cachefiles fscache esp6_offload esp6
>> esp4_offload esp4 xfrm_algo nls_iso8859_1 intel_rapl sb_edac
>> x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel
>> nvidia_uvm(POE) mxm_wmi kvm joydev input_leds irqbypass intel_cstate
>> intel_rapl_perf mei_me ipmi_si shpchp lpc_ich mei acpi_power_meter
>> mac_hid wmi ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
>> libiscsi scsi_transport_iscsi ipmi_devintf sunrpc ipmi_msghandler
>> autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
>> async_memcpy
>>  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
>> multipath linear i2c_algo_bit nvidia_drm(POE) ses crct10dif_pclmul
>> crc32_pclmul nvidia_modeset(POE) ttm ghash_clmulni_intel enclosure
>> hid_generic uas pcbc scsi_transport_sas drm_kms_helper usbhid
>> aesni_intel hid aes_x86_64 usb_storage syscopyarea crypto_simd
>> sysfillrect mlx5_core cryptd nvidia(POE) sysimgblt glue_helper ixgbe
>> mlxfw megaraid_sas fb_sys_fops devlink dca ahci ptp drm libahci
>> pps_core mdio
>> CPU: 54 PID: 8938 Comm: kworker/u161:0 Tainted: P           OE
>> 4.16.0-999-generic #201804102200
>>
>> Workqueue: writeback wb_workfn (flush-8:0)
>> RIP: 0010:ext4_writepage+0x318/0x770
>> RSP: 0018:ffffb514e76cb7f8 EFLAGS: 00010246
>> RAX: 00500b4e8000026d RBX: 0000000000001000 RCX: ffff8c2cafba5000
>> RDX: ffff8bec4069a020 RSI: ffffb514e76cbc28 RDI: ffffe91efcc8bd80
>> RBP: ffffb514e76cb870 R08: 0000000000028115 R09: 00000000000280c0
>> R10: 0000000000000002 R11: ffff8c2dbffd4000 R12: ffffe91efcc8bd80
>> R13: ffff8bec40699ea8 R14: ffffb514e76cbc28 R15: ffffe91efcc8bd80
>> FS:  0000000000000000(0000) GS:ffff8becbfc80000(0000)
>> knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007ffcc8f6f080 CR3: 0000004b9ce0a006 CR4: 00000000003606e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>  ? rmap_walk+0x41/0x60
>>  ? page_mkclean+0x9f/0xb0
>>  ? invalid_page_referenced_vma+0x80/0x80
>>  __writepage+0x17/0x50
>>  write_cache_pages+0x228/0x4a0
>>  ? __wb_calc_thresh+0x140/0x140
>>  generic_writepages+0x61/0xa0
>>  ? _cond_resched+0x1a/0x50
>>  ? write_cache_pages+0x396/0x4a0
>>  ext4_writepages+0x1fc/0xe00
>>  ? ext4_writepages+0x1fc/0xe00
>>  ? generic_writepages+0x6d/0xa0
>>  ? fprop_fraction_percpu+0x2f/0x80
>>  do_writepages+0x1c/0x60
>>  ? do_writepages+0x1c/0x60
>>  __writeback_single_inode+0x45/0x320
>>  writeback_sb_inodes+0x266/0x580
>>  __writeback_inodes_wb+0x92/0xc0
>>  wb_writeback+0x282/0x310
>>  wb_workfn+0x1a3/0x440
>>  ? wb_workfn+0x1a3/0x440
>>  process_one_work+0x1db/0x3c0
>>  worker_thread+0x4b/0x420
>>  kthread+0x102/0x140
>>  ? rescuer_thread+0x380/0x380
>>  ? kthread_create_worker_on_cpu+0x70/0x70
>>  ret_from_fork+0x35/0x40
>> Code: ff f6 c4 08 0f 85 58 ff ff ff e8 68 56 00 00 ba 00 10 00 00 31
>> f6 41 bd fb ff ff ff e8 a2 9e ff ff 4c 89 e7 e8 fa 39 ea ff eb 8a <0f>
>> 0b c6 45 a8 00 e9 f0 fd ff ff 49 83 7c 24 10 00 0f 85 c7 03
>> RIP: ext4_writepage+0x318/0x770 RSP: ffffb514e76cb7f8
>> ---[ end trace 59d4e1a4b221404b ]---
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

##############################
## The reproducing scenario ##
##############################

The reproducer is creating eight processes by the mpirun and each
process is bound to the underlying GPU computation unit by the
cudaSetDevice() interface. The scenario is that each process creates a
8K bytes buffer by the cudaMalloc() and sends/receives the 8K bytes
buffer in a circular buffer behavior by the MPI_Isend()/MPI_Irecv()
like:

P0 -> P1 -> P2 -> P3 -> P4 -> P5 -> P6 -> P7-> to P0

The sending/receiving buffer is iterated 1000 times, then to proceed
to clean up the resource allocated to the cuda device by:

cudaDeviceReset();
MPI_Finalize();

However, the sequence is actually controversial as the
cudaDeviceReset() already cleaned up the resource, including the 8K
bytes memory allocated by the cudaMalloc(), MPI_Finalize() could
access the memory allocated by cudaMalloc().  Finally, the
MPI_Finalize() accidentally triggers the kernel bug.

I tried to look up the implementation of the MPI library,
unfortunately, the exact mechanism corresponded to the
try_to_free_buffers()/set_page_dirty() is still unknown.

##############################
##### The kernel testing #####
##############################

1). I tried the patch which tries to capture the dirty page scenario when
the page is released, however, there is no warning message when the
bug is successfully reproduced. So, it eliminates the possibility that
the ext4_releasepage() removes the buffer of a dirty page.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3d4d1dccc8a1..9310277c8e3e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3384,6 +3384,11 @@ static int ext4_releasepage(struct page *page,
gfp_t wait)

        trace_ext4_releasepage(page);

+       if (PageDirty(page)) {
+               WARN_ON(1);
+               return 0;
+       }
+
        /* Page has dirty journalled data -> cannot release */
        if (PageChecked(page))
                return 0;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 930aa0d19761..3bd1a605853f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1134,6 +1134,11 @@ static int bdev_try_to_free_page(struct
super_block *sb, struct page *page,
 {
        journal_t *journal = EXT4_SB(sb)->s_journal;

+       if (PageDirty(page)) {
+               WARN_ON(1);
+               return 0;
+       }
+
        WARN_ON(PageChecked(page));
        if (!page_has_buffers(page))
                return 0;


2). Then, I proceeded to try the patch mentioned in link[1] and found
the patch was already merged into the kernel v4.10-rc1 with commit id:

6dcc693bc57f ext4: warn when page is dirtied without buffers

According to the code:

static int ext4_set_page_dirty(struct page *page)
{
       WARN_ON_ONCE(!PageLocked(page) && !PageDirty(page));
       WARN_ON_ONCE(!page_has_buffers(page));
       return __set_page_dirty_buffers(page);
}

I also cannot find the warning message[2] with the
"WARN_ON_ONCE(!page_has_buffers(page))" when reproduced with the
latest upstream kernel.

With the patch in the current kernel, I think the first scenario,
CPU1 is doing direct IO and CPU2 is doing the reclaim, also can be
captured in the ext4_set_page_dirty(). And there is no error message
related to the warning, so, it seems not related to this case. Am I
missing anything?

> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
set_page_dirty
  int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
    return (*spd)(page);
      ext4_set_page_dirty
        WARN_ON_ONCE(!page_has_buffers(page));
>       Bummer, we've just marked the page as dirty without having buffers.
>       Eventually writeback will find it and filesystem will complain...

There could be other path triggering the bug. Any idea where to add
the debug message?

##############################
######### Reference ##########
##############################

[1]. kernel BUG at fs/ext4/inode.c:2428!
https://patchwork.ozlabs.org/patch/697715/

[2]. coredump message of latest upstream kernel
http://paste.ubuntu.com/p/FfDk9cChdX/

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-04-16  9:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-03 10:04 Filesystem crashes due to pages without buffers Jan Kara
2018-01-03 10:04 ` Jan Kara
2018-01-04  4:56 ` Dan Williams
2018-01-04  4:56   ` Dan Williams
2018-01-04 11:33   ` Jan Kara
2018-01-04 11:33     ` Jan Kara
2018-04-13 12:39     ` Gavin Guo
2018-04-13 12:58       ` Jan Kara
2018-04-16  9:20         ` Gavin Guo
2018-01-04  5:59 ` Dave Chinner
2018-01-04  5:59   ` Dave Chinner
2018-01-04  8:52   ` Jan Kara
2018-01-04  8:52     ` Jan Kara
2018-01-04 10:08     ` Dave Chinner
2018-01-04 10:08       ` Dave Chinner
2018-01-04  6:10 ` Leon Romanovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.