hole-punch vs fault

* hole-punch vs fault
@ 2013-11-27 13:48 Matthew Wilcox
  2013-11-27 22:19 ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2013-11-27 13:48 UTC (permalink / raw)
  To: linux-fsdevel

I'm trying to figure out what protects us in this scenario:

Thread A mmaps a file and reads from it.
Thread B punches holes in that file.

It seems to me that thread A can end up with references to pages that are
in the page cache that point to deallocated blocks (which the filesystem
might subsequently allocate to a different file).

Thread B calls ext4_punch_hole()
Thread B takes ->i_mutex
Thread B calls truncate_pagecache_range()
Thread B calls unmap_mapping_range() which removes the PTE from Thread A's
  page tables.
Thread B calls truncate_inode_pages_range() which removes the pages from the
  page cache.
Thread A calls filemap_fault().  It calls find_get_page(), but it's not there.
  It calls do_sync_mmap_readahead() then find_get_page() again, still not
  there.  So it calls page_cache_read() which calls ext4_readpage which calls
  mpage_readpage(), which calls ext4_get_block() which finds the blocks
  in question, nowhere taking the i_mutex.
Now Thread B gets back from its summer holiday, takes i_data_sem and calls
  ext4_es_remove_extent() and ext4_ext_remove_space() /
  ext4_free_hole_blocks() before releasing i_data_sem and i_mutex.
Thread A happily dirties the page it has.
Thread C extends a file, which gets the victim block allocated.
vmscan causes the page to be written back, since it's dirty and on the LRU lists
Bang, thread C has a corrupted file.

This doesn't happen with the normal truncate path since i_size is written
before calling unmap_mapping_range, and i_size is checked after filemap_fault
gets the page lock.

How should we solve this?  I see two realistic options.  One is an rwsem
in the inode or address_space taken for read by the fault path and for
write by the holepunch path.  It's kind of icky because each filesystem
will need to add support for it.

The second is to put some kind of generation counter in the inode or
address_space that is incremented on every deallocation.  The fault path
would read the generation counter before calling find_get_page() and then
check it hasn't changed after getting the page lock (goto retry_find).  I
like that one a little more since we can fix it all in common code, and I
think it'll be lower overhead in the fault path.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread