* [PATCH 0/3] Avoid dirtying pages during mlock @ 2010-11-17 12:23 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 12:23 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal As discussed in linux-mm, mlocking a shared, writable vma currently causes the corresponding pages to be marked as dirty and queued for writeback. This seems rather unnecessary given that the pages are not being actually modified during mlock. It is understood that for non-shared mappings (file or anon) we want to use a write fault in order to break COW, but there is just no such need for shared mappings. The first two patches in this series do not introduce any behavior change. The intent there is to make it obvious that dirtying file pages is only done in the (writable, shared) case. I think this clarifies the code, but I wouldn't mind dropping these two patches if there is no consensus about them. The last patch is where we actually avoid dirtying shared mappings during mlock. Note that as a side effect of this, we won't call page_mkwrite() for the mappings that define it, and won't be pre-allocating data blocks at the FS level if the mapped file was sparsely allocated. My understanding is that mlock does not need to provide such guarantee, as evidenced by the fact that it never did for the filesystems that don't define page_mkwrite() - including some common ones like ext3. However, I would like to gather feedback on this from filesystem people as a precaution. If this turns out to be a showstopper, maybe block preallocation can be added back on using a different interface. Large shared mlocks are getting significantly (>2x) faster in my tests, as the disk can be fully used for reading the file instead of having to share between this and writeback. Michel Lespinasse (3): do_wp_page: remove the 'reuse' flag do_wp_page: clarify dirty_page handling mlock: avoid dirtying pages and triggering writeback mm/memory.c | 90 ++++++++++++++++++++++++++++++++--------------------------- mm/mlock.c | 7 ++++- 2 files changed, 55 insertions(+), 42 deletions(-) -- 1.7.3.1 ^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH 0/3] Avoid dirtying pages during mlock @ 2010-11-17 12:23 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 12:23 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal As discussed in linux-mm, mlocking a shared, writable vma currently causes the corresponding pages to be marked as dirty and queued for writeback. This seems rather unnecessary given that the pages are not being actually modified during mlock. It is understood that for non-shared mappings (file or anon) we want to use a write fault in order to break COW, but there is just no such need for shared mappings. The first two patches in this series do not introduce any behavior change. The intent there is to make it obvious that dirtying file pages is only done in the (writable, shared) case. I think this clarifies the code, but I wouldn't mind dropping these two patches if there is no consensus about them. The last patch is where we actually avoid dirtying shared mappings during mlock. Note that as a side effect of this, we won't call page_mkwrite() for the mappings that define it, and won't be pre-allocating data blocks at the FS level if the mapped file was sparsely allocated. My understanding is that mlock does not need to provide such guarantee, as evidenced by the fact that it never did for the filesystems that don't define page_mkwrite() - including some common ones like ext3. However, I would like to gather feedback on this from filesystem people as a precaution. If this turns out to be a showstopper, maybe block preallocation can be added back on using a different interface. Large shared mlocks are getting significantly (>2x) faster in my tests, as the disk can be fully used for reading the file instead of having to share between this and writeback. Michel Lespinasse (3): do_wp_page: remove the 'reuse' flag do_wp_page: clarify dirty_page handling mlock: avoid dirtying pages and triggering writeback mm/memory.c | 90 ++++++++++++++++++++++++++++++++--------------------------- mm/mlock.c | 7 ++++- 2 files changed, 55 insertions(+), 42 deletions(-) -- 1.7.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH 1/3] do_wp_page: remove the 'reuse' flag 2010-11-17 12:23 ` Michel Lespinasse @ 2010-11-17 12:23 ` Michel Lespinasse -1 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 12:23 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal Reorganize the code to remove the 'reuse' flag. No behavior changes. Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/memory.c | 11 +++++------ 1 files changed, 5 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 0e18b4d..810a75f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2110,7 +2110,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, { struct page *old_page, *new_page; pte_t entry; - int reuse = 0, ret = 0; + int ret = 0; int page_mkwrite = 0; struct page *dirty_page = NULL; @@ -2147,14 +2147,16 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, } page_cache_release(old_page); } - reuse = reuse_swap_page(old_page); - if (reuse) + if (reuse_swap_page(old_page)) { /* * The page is all ours. Move it to our anon_vma so * the rmap code will not search our parent or siblings. * Protected against the rmap code by the page lock. */ page_move_anon_rmap(old_page, vma, address); + unlock_page(old_page); + goto reuse; + } unlock_page(old_page); } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { @@ -2218,10 +2220,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, } dirty_page = old_page; get_page(dirty_page); - reuse = 1; - } - if (reuse) { reuse: flush_cache_page(vma, address, pte_pfn(orig_pte)); entry = pte_mkyoung(orig_pte); -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 1/3] do_wp_page: remove the 'reuse' flag @ 2010-11-17 12:23 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 12:23 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal Reorganize the code to remove the 'reuse' flag. No behavior changes. Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/memory.c | 11 +++++------ 1 files changed, 5 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 0e18b4d..810a75f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2110,7 +2110,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, { struct page *old_page, *new_page; pte_t entry; - int reuse = 0, ret = 0; + int ret = 0; int page_mkwrite = 0; struct page *dirty_page = NULL; @@ -2147,14 +2147,16 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, } page_cache_release(old_page); } - reuse = reuse_swap_page(old_page); - if (reuse) + if (reuse_swap_page(old_page)) { /* * The page is all ours. Move it to our anon_vma so * the rmap code will not search our parent or siblings. * Protected against the rmap code by the page lock. */ page_move_anon_rmap(old_page, vma, address); + unlock_page(old_page); + goto reuse; + } unlock_page(old_page); } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { @@ -2218,10 +2220,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, } dirty_page = old_page; get_page(dirty_page); - reuse = 1; - } - if (reuse) { reuse: flush_cache_page(vma, address, pte_pfn(orig_pte)); entry = pte_mkyoung(orig_pte); -- 1.7.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 2/3] do_wp_page: clarify dirty_page handling 2010-11-17 12:23 ` Michel Lespinasse @ 2010-11-17 12:23 ` Michel Lespinasse -1 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 12:23 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal Reorganize the code so that dirty pages are handled closer to the place that makes them dirty (handling write fault into shared, writable VMAs). No behavior changes. Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/memory.c | 72 +++++++++++++++++++++++++++++++--------------------------- 1 files changed, 38 insertions(+), 34 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 810a75f..d4c0c2e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2227,8 +2227,45 @@ reuse: entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, page_table); + pte_unmap_unlock(page_table, ptl); ret |= VM_FAULT_WRITE; - goto unlock; + + if (!dirty_page) + return ret; + + /* + * Yes, Virginia, this is actually required to prevent a race + * with clear_page_dirty_for_io() from clearing the page dirty + * bit after it clear all dirty ptes, but before a racing + * do_wp_page installs a dirty pte. + * + * do_no_page is protected similarly. + */ + if (!page_mkwrite) { + wait_on_page_locked(dirty_page); + set_page_dirty_balance(dirty_page, page_mkwrite); + } + put_page(dirty_page); + if (page_mkwrite) { + struct address_space *mapping = dirty_page->mapping; + + set_page_dirty(dirty_page); + unlock_page(dirty_page); + page_cache_release(dirty_page); + if (mapping) { + /* + * Some device drivers do not set page.mapping + * but still dirty their pages + */ + balance_dirty_pages_ratelimited(mapping); + } + } + + /* file_update_time outside page_lock */ + if (vma->vm_file) + file_update_time(vma->vm_file); + + return ret; } /* @@ -2334,39 +2371,6 @@ gotten: page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); - if (dirty_page) { - /* - * Yes, Virginia, this is actually required to prevent a race - * with clear_page_dirty_for_io() from clearing the page dirty - * bit after it clear all dirty ptes, but before a racing - * do_wp_page installs a dirty pte. - * - * do_no_page is protected similarly. - */ - if (!page_mkwrite) { - wait_on_page_locked(dirty_page); - set_page_dirty_balance(dirty_page, page_mkwrite); - } - put_page(dirty_page); - if (page_mkwrite) { - struct address_space *mapping = dirty_page->mapping; - - set_page_dirty(dirty_page); - unlock_page(dirty_page); - page_cache_release(dirty_page); - if (mapping) { - /* - * Some device drivers do not set page.mapping - * but still dirty their pages - */ - balance_dirty_pages_ratelimited(mapping); - } - } - - /* file_update_time outside page_lock */ - if (vma->vm_file) - file_update_time(vma->vm_file); - } return ret; oom_free_new: page_cache_release(new_page); -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 2/3] do_wp_page: clarify dirty_page handling @ 2010-11-17 12:23 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 12:23 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal Reorganize the code so that dirty pages are handled closer to the place that makes them dirty (handling write fault into shared, writable VMAs). No behavior changes. Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/memory.c | 72 +++++++++++++++++++++++++++++++--------------------------- 1 files changed, 38 insertions(+), 34 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 810a75f..d4c0c2e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2227,8 +2227,45 @@ reuse: entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, page_table); + pte_unmap_unlock(page_table, ptl); ret |= VM_FAULT_WRITE; - goto unlock; + + if (!dirty_page) + return ret; + + /* + * Yes, Virginia, this is actually required to prevent a race + * with clear_page_dirty_for_io() from clearing the page dirty + * bit after it clear all dirty ptes, but before a racing + * do_wp_page installs a dirty pte. + * + * do_no_page is protected similarly. + */ + if (!page_mkwrite) { + wait_on_page_locked(dirty_page); + set_page_dirty_balance(dirty_page, page_mkwrite); + } + put_page(dirty_page); + if (page_mkwrite) { + struct address_space *mapping = dirty_page->mapping; + + set_page_dirty(dirty_page); + unlock_page(dirty_page); + page_cache_release(dirty_page); + if (mapping) { + /* + * Some device drivers do not set page.mapping + * but still dirty their pages + */ + balance_dirty_pages_ratelimited(mapping); + } + } + + /* file_update_time outside page_lock */ + if (vma->vm_file) + file_update_time(vma->vm_file); + + return ret; } /* @@ -2334,39 +2371,6 @@ gotten: page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); - if (dirty_page) { - /* - * Yes, Virginia, this is actually required to prevent a race - * with clear_page_dirty_for_io() from clearing the page dirty - * bit after it clear all dirty ptes, but before a racing - * do_wp_page installs a dirty pte. - * - * do_no_page is protected similarly. - */ - if (!page_mkwrite) { - wait_on_page_locked(dirty_page); - set_page_dirty_balance(dirty_page, page_mkwrite); - } - put_page(dirty_page); - if (page_mkwrite) { - struct address_space *mapping = dirty_page->mapping; - - set_page_dirty(dirty_page); - unlock_page(dirty_page); - page_cache_release(dirty_page); - if (mapping) { - /* - * Some device drivers do not set page.mapping - * but still dirty their pages - */ - balance_dirty_pages_ratelimited(mapping); - } - } - - /* file_update_time outside page_lock */ - if (vma->vm_file) - file_update_time(vma->vm_file); - } return ret; oom_free_new: page_cache_release(new_page); -- 1.7.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 12:23 ` Michel Lespinasse @ 2010-11-17 12:23 ` Michel Lespinasse -1 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 12:23 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal When faulting in pages for mlock(), we want to break COW for anonymous or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no need to write-fault into VM_SHARED vmas since shared file pages can be mlocked first and dirtied later, when/if they actually get written to. Skipping the write fault is desirable, as we don't want to unnecessarily cause these pages to be dirtied and queued for writeback. Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/memory.c | 7 ++++++- mm/mlock.c | 7 ++++++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index d4c0c2e..7f45085 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3291,7 +3291,12 @@ int make_pages_present(unsigned long addr, unsigned long end) vma = find_vma(current->mm, addr); if (!vma) return -ENOMEM; - write = (vma->vm_flags & VM_WRITE) != 0; + /* + * We want to touch writable mappings with a write fault in order + * to break COW, except for shared mappings because these don't COW + * and we would not want to dirty them for nothing. + */ + write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE; BUG_ON(addr >= end); BUG_ON(end > vma->vm_end); len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE; diff --git a/mm/mlock.c b/mm/mlock.c index b70919c..4f31864 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -171,7 +171,12 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); gup_flags = FOLL_TOUCH | FOLL_GET; - if (vma->vm_flags & VM_WRITE) + /* + * We want to touch writable mappings with a write fault in order + * to break COW, except for shared mappings because these don't COW + * and we would not want to dirty them for nothing. + */ + if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) gup_flags |= FOLL_WRITE; /* We don't try to access the guard page of a stack vma */ -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-17 12:23 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 12:23 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal When faulting in pages for mlock(), we want to break COW for anonymous or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no need to write-fault into VM_SHARED vmas since shared file pages can be mlocked first and dirtied later, when/if they actually get written to. Skipping the write fault is desirable, as we don't want to unnecessarily cause these pages to be dirtied and queued for writeback. Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/memory.c | 7 ++++++- mm/mlock.c | 7 ++++++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index d4c0c2e..7f45085 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3291,7 +3291,12 @@ int make_pages_present(unsigned long addr, unsigned long end) vma = find_vma(current->mm, addr); if (!vma) return -ENOMEM; - write = (vma->vm_flags & VM_WRITE) != 0; + /* + * We want to touch writable mappings with a write fault in order + * to break COW, except for shared mappings because these don't COW + * and we would not want to dirty them for nothing. + */ + write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE; BUG_ON(addr >= end); BUG_ON(end > vma->vm_end); len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE; diff --git a/mm/mlock.c b/mm/mlock.c index b70919c..4f31864 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -171,7 +171,12 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); gup_flags = FOLL_TOUCH | FOLL_GET; - if (vma->vm_flags & VM_WRITE) + /* + * We want to touch writable mappings with a write fault in order + * to break COW, except for shared mappings because these don't COW + * and we would not want to dirty them for nothing. + */ + if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) gup_flags |= FOLL_WRITE; /* We don't try to access the guard page of a stack vma */ -- 1.7.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 12:23 ` Michel Lespinasse @ 2010-11-17 12:57 ` Nick Piggin -1 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2010-11-17 12:57 UTC (permalink / raw) To: Michel Lespinasse Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: > When faulting in pages for mlock(), we want to break COW for anonymous > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is > no need to write-fault into VM_SHARED vmas since shared file pages can > be mlocked first and dirtied later, when/if they actually get written to. > Skipping the write fault is desirable, as we don't want to unnecessarily > cause these pages to be dirtied and queued for writeback. It's not just to break COW, but to do block allocation and such (filesystem's page_mkwrite op). That needs to at least be explained in the changelog. Filesystem doesn't have a good way to fully pin required things according to mlock, but page_mkwrite provides some reasonable things (like block allocation / reservation). ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-17 12:57 ` Nick Piggin 0 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2010-11-17 12:57 UTC (permalink / raw) To: Michel Lespinasse Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Peter Zijlstra, Nick Piggin, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: > When faulting in pages for mlock(), we want to break COW for anonymous > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is > no need to write-fault into VM_SHARED vmas since shared file pages can > be mlocked first and dirtied later, when/if they actually get written to. > Skipping the write fault is desirable, as we don't want to unnecessarily > cause these pages to be dirtied and queued for writeback. It's not just to break COW, but to do block allocation and such (filesystem's page_mkwrite op). That needs to at least be explained in the changelog. Filesystem doesn't have a good way to fully pin required things according to mlock, but page_mkwrite provides some reasonable things (like block allocation / reservation). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 12:57 ` Nick Piggin @ 2010-11-17 15:28 ` Peter Zijlstra -1 siblings, 0 replies; 56+ messages in thread From: Peter Zijlstra @ 2010-11-17 15:28 UTC (permalink / raw) To: Nick Piggin Cc: Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote: > On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: > > When faulting in pages for mlock(), we want to break COW for anonymous > > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is > > no need to write-fault into VM_SHARED vmas since shared file pages can > > be mlocked first and dirtied later, when/if they actually get written to. > > Skipping the write fault is desirable, as we don't want to unnecessarily > > cause these pages to be dirtied and queued for writeback. > > It's not just to break COW, but to do block allocation and such > (filesystem's page_mkwrite op). That needs to at least be explained > in the changelog. Agreed, the 0/3 description actually does mention this. > Filesystem doesn't have a good way to fully pin required things > according to mlock, but page_mkwrite provides some reasonable things > (like block allocation / reservation). Right, but marking all pages dirty isn't really sane. I can imagine making the reservation but not marking things dirty solution, although it might be lots harder to implement, esp since some filesystems don't actually have a page_mkwrite() implementation. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-17 15:28 ` Peter Zijlstra 0 siblings, 0 replies; 56+ messages in thread From: Peter Zijlstra @ 2010-11-17 15:28 UTC (permalink / raw) To: Nick Piggin Cc: Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote: > On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: > > When faulting in pages for mlock(), we want to break COW for anonymous > > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is > > no need to write-fault into VM_SHARED vmas since shared file pages can > > be mlocked first and dirtied later, when/if they actually get written to. > > Skipping the write fault is desirable, as we don't want to unnecessarily > > cause these pages to be dirtied and queued for writeback. > > It's not just to break COW, but to do block allocation and such > (filesystem's page_mkwrite op). That needs to at least be explained > in the changelog. Agreed, the 0/3 description actually does mention this. > Filesystem doesn't have a good way to fully pin required things > according to mlock, but page_mkwrite provides some reasonable things > (like block allocation / reservation). Right, but marking all pages dirty isn't really sane. I can imagine making the reservation but not marking things dirty solution, although it might be lots harder to implement, esp since some filesystems don't actually have a page_mkwrite() implementation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 15:28 ` Peter Zijlstra @ 2010-11-17 22:05 ` Michel Lespinasse -1 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 22:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 7:28 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote: >> On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: >> > When faulting in pages for mlock(), we want to break COW for anonymous >> > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is >> > no need to write-fault into VM_SHARED vmas since shared file pages can >> > be mlocked first and dirtied later, when/if they actually get written to. >> > Skipping the write fault is desirable, as we don't want to unnecessarily >> > cause these pages to be dirtied and queued for writeback. >> >> It's not just to break COW, but to do block allocation and such >> (filesystem's page_mkwrite op). That needs to at least be explained >> in the changelog. > > Agreed, the 0/3 description actually does mention this. > >> Filesystem doesn't have a good way to fully pin required things >> according to mlock, but page_mkwrite provides some reasonable things >> (like block allocation / reservation). > > Right, but marking all pages dirty isn't really sane. I can imagine > making the reservation but not marking things dirty solution, although > it might be lots harder to implement, esp since some filesystems don't > actually have a page_mkwrite() implementation. Really, my understanding is that not pre-allocating filesystem blocks is just fine. This is, after all, what happens with ext3 and it's never been reported as a bug (that I know of). If filesystem people's feedback is that they really want mlock() to continue pre-allocating blocks, maybe we can just do it using fallocate() rather than page_mkwrite() callbacks ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-17 22:05 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 22:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 7:28 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote: >> On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: >> > When faulting in pages for mlock(), we want to break COW for anonymous >> > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is >> > no need to write-fault into VM_SHARED vmas since shared file pages can >> > be mlocked first and dirtied later, when/if they actually get written to. >> > Skipping the write fault is desirable, as we don't want to unnecessarily >> > cause these pages to be dirtied and queued for writeback. >> >> It's not just to break COW, but to do block allocation and such >> (filesystem's page_mkwrite op). That needs to at least be explained >> in the changelog. > > Agreed, the 0/3 description actually does mention this. > >> Filesystem doesn't have a good way to fully pin required things >> according to mlock, but page_mkwrite provides some reasonable things >> (like block allocation / reservation). > > Right, but marking all pages dirty isn't really sane. I can imagine > making the reservation but not marking things dirty solution, although > it might be lots harder to implement, esp since some filesystems don't > actually have a page_mkwrite() implementation. Really, my understanding is that not pre-allocating filesystem blocks is just fine. This is, after all, what happens with ext3 and it's never been reported as a bug (that I know of). If filesystem people's feedback is that they really want mlock() to continue pre-allocating blocks, maybe we can just do it using fallocate() rather than page_mkwrite() callbacks ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 22:05 ` Michel Lespinasse @ 2010-11-17 22:18 ` Peter Zijlstra -1 siblings, 0 replies; 56+ messages in thread From: Peter Zijlstra @ 2010-11-17 22:18 UTC (permalink / raw) To: Michel Lespinasse Cc: Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, 2010-11-17 at 14:05 -0800, Michel Lespinasse wrote: > > Really, my understanding is that not pre-allocating filesystem blocks > is just fine. This is, after all, what happens with ext3 and it's > never been reported as a bug (that I know of). > fwiw I'm perfectly fine with it > If filesystem people's feedback is that they really want mlock() to > continue pre-allocating blocks, maybe we can just do it using > fallocate() rather than page_mkwrite() callbacks ? Sounds sensible.. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-17 22:18 ` Peter Zijlstra 0 siblings, 0 replies; 56+ messages in thread From: Peter Zijlstra @ 2010-11-17 22:18 UTC (permalink / raw) To: Michel Lespinasse Cc: Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, 2010-11-17 at 14:05 -0800, Michel Lespinasse wrote: > > Really, my understanding is that not pre-allocating filesystem blocks > is just fine. This is, after all, what happens with ext3 and it's > never been reported as a bug (that I know of). > fwiw I'm perfectly fine with it > If filesystem people's feedback is that they really want mlock() to > continue pre-allocating blocks, maybe we can just do it using > fallocate() rather than page_mkwrite() callbacks ? Sounds sensible.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 22:05 ` Michel Lespinasse @ 2010-11-17 23:11 ` Dave Chinner -1 siblings, 0 replies; 56+ messages in thread From: Dave Chinner @ 2010-11-17 23:11 UTC (permalink / raw) To: Michel Lespinasse Cc: Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 02:05:30PM -0800, Michel Lespinasse wrote: > On Wed, Nov 17, 2010 at 7:28 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote: > >> On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: > >> > When faulting in pages for mlock(), we want to break COW for anonymous > >> > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is > >> > no need to write-fault into VM_SHARED vmas since shared file pages can > >> > be mlocked first and dirtied later, when/if they actually get written to. > >> > Skipping the write fault is desirable, as we don't want to unnecessarily > >> > cause these pages to be dirtied and queued for writeback. > >> > >> It's not just to break COW, but to do block allocation and such > >> (filesystem's page_mkwrite op). That needs to at least be explained > >> in the changelog. > > > > Agreed, the 0/3 description actually does mention this. > > > >> Filesystem doesn't have a good way to fully pin required things > >> according to mlock, but page_mkwrite provides some reasonable things > >> (like block allocation / reservation). > > > > Right, but marking all pages dirty isn't really sane. I can imagine > > making the reservation but not marking things dirty solution, although > > it might be lots harder to implement, esp since some filesystems don't > > actually have a page_mkwrite() implementation. > > Really, my understanding is that not pre-allocating filesystem blocks > is just fine. This is, after all, what happens with ext3 and it's > never been reported as a bug (that I know of). It's not ext3 you have to worry about - it's the filesystems that need special state set up on their pages/buffers for ->writepage to work correctly that are the problem. You need to call ->write_begin/->write_end to get the state set up properly. If this state is not set up properly, silent data loss will occur during mmap writes either by ENOSPC or failing to set up writes into unwritten extents correctly (i.e. we'll be back to where we were in 2.6.15). I don't think ->page_mkwrite can be worked around - we need that to be called on the first write fault of any mmap()d page to ensure it is set up correctly for writeback. If we don't get write faults after the page is mlock()d, then we need the ->page_mkwrite() call during the mlock() call. > If filesystem people's feedback is that they really want mlock() to > continue pre-allocating blocks, maybe we can just do it using > fallocate() rather than page_mkwrite() callbacks ? Fallocate is not good enough to avoid ->page_mkwrite callbacks. Indeed, XFS (at least) requires the ->page_mkwrite() callout even on preallocated space to correctly mark the buffers as unwritten so extent conversion in ->writepage is triggered correctly (see test #166 in xfstests). Hence I think that avoiding ->page_mkwrite callouts is likely to break some filesystems in subtle, undetected ways. IMO, regardless of what is done, it would be really good to start by writing a new regression test to exercise and encode the expected the mlock behaviour so we can detect regressions later on.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-17 23:11 ` Dave Chinner 0 siblings, 0 replies; 56+ messages in thread From: Dave Chinner @ 2010-11-17 23:11 UTC (permalink / raw) To: Michel Lespinasse Cc: Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 02:05:30PM -0800, Michel Lespinasse wrote: > On Wed, Nov 17, 2010 at 7:28 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote: > >> On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: > >> > When faulting in pages for mlock(), we want to break COW for anonymous > >> > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is > >> > no need to write-fault into VM_SHARED vmas since shared file pages can > >> > be mlocked first and dirtied later, when/if they actually get written to. > >> > Skipping the write fault is desirable, as we don't want to unnecessarily > >> > cause these pages to be dirtied and queued for writeback. > >> > >> It's not just to break COW, but to do block allocation and such > >> (filesystem's page_mkwrite op). That needs to at least be explained > >> in the changelog. > > > > Agreed, the 0/3 description actually does mention this. > > > >> Filesystem doesn't have a good way to fully pin required things > >> according to mlock, but page_mkwrite provides some reasonable things > >> (like block allocation / reservation). > > > > Right, but marking all pages dirty isn't really sane. I can imagine > > making the reservation but not marking things dirty solution, although > > it might be lots harder to implement, esp since some filesystems don't > > actually have a page_mkwrite() implementation. > > Really, my understanding is that not pre-allocating filesystem blocks > is just fine. This is, after all, what happens with ext3 and it's > never been reported as a bug (that I know of). It's not ext3 you have to worry about - it's the filesystems that need special state set up on their pages/buffers for ->writepage to work correctly that are the problem. You need to call ->write_begin/->write_end to get the state set up properly. If this state is not set up properly, silent data loss will occur during mmap writes either by ENOSPC or failing to set up writes into unwritten extents correctly (i.e. we'll be back to where we were in 2.6.15). I don't think ->page_mkwrite can be worked around - we need that to be called on the first write fault of any mmap()d page to ensure it is set up correctly for writeback. If we don't get write faults after the page is mlock()d, then we need the ->page_mkwrite() call during the mlock() call. > If filesystem people's feedback is that they really want mlock() to > continue pre-allocating blocks, maybe we can just do it using > fallocate() rather than page_mkwrite() callbacks ? Fallocate is not good enough to avoid ->page_mkwrite callbacks. Indeed, XFS (at least) requires the ->page_mkwrite() callout even on preallocated space to correctly mark the buffers as unwritten so extent conversion in ->writepage is triggered correctly (see test #166 in xfstests). Hence I think that avoiding ->page_mkwrite callouts is likely to break some filesystems in subtle, undetected ways. IMO, regardless of what is done, it would be really good to start by writing a new regression test to exercise and encode the expected the mlock behaviour so we can detect regressions later on.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 23:11 ` Dave Chinner @ 2010-11-17 23:31 ` Michel Lespinasse -1 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 23:31 UTC (permalink / raw) To: Dave Chinner Cc: Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 3:11 PM, Dave Chinner <david@fromorbit.com> wrote: >> Really, my understanding is that not pre-allocating filesystem blocks >> is just fine. This is, after all, what happens with ext3 and it's >> never been reported as a bug (that I know of). > > It's not ext3 you have to worry about - it's the filesystems that > need special state set up on their pages/buffers for ->writepage to > work correctly that are the problem. You need to call > ->write_begin/->write_end to get the state set up properly. > > If this state is not set up properly, silent data loss will occur > during mmap writes either by ENOSPC or failing to set up writes into > unwritten extents correctly (i.e. we'll be back to where we were in > 2.6.15). > > I don't think ->page_mkwrite can be worked around - we need that to > be called on the first write fault of any mmap()d page to ensure it > is set up correctly for writeback. If we don't get write faults > after the page is mlock()d, then we need the ->page_mkwrite() call > during the mlock() call. Just to be clear - I'm proposing to skip the entire do_wp_page() call by doing a read fault rather than a write fault. If the page wasn't dirty already, it will stay clean and with a non-writable PTE until it gets actually written to, at which point we'll get a write fault and do_wp_page will be invoked as usual. I am not proposing to skip the page_mkwrite() while upgrading the PTE permissions, which I think is what you were arguing against ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-17 23:31 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-17 23:31 UTC (permalink / raw) To: Dave Chinner Cc: Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 3:11 PM, Dave Chinner <david@fromorbit.com> wrote: >> Really, my understanding is that not pre-allocating filesystem blocks >> is just fine. This is, after all, what happens with ext3 and it's >> never been reported as a bug (that I know of). > > It's not ext3 you have to worry about - it's the filesystems that > need special state set up on their pages/buffers for ->writepage to > work correctly that are the problem. You need to call > ->write_begin/->write_end to get the state set up properly. > > If this state is not set up properly, silent data loss will occur > during mmap writes either by ENOSPC or failing to set up writes into > unwritten extents correctly (i.e. we'll be back to where we were in > 2.6.15). > > I don't think ->page_mkwrite can be worked around - we need that to > be called on the first write fault of any mmap()d page to ensure it > is set up correctly for writeback. If we don't get write faults > after the page is mlock()d, then we need the ->page_mkwrite() call > during the mlock() call. Just to be clear - I'm proposing to skip the entire do_wp_page() call by doing a read fault rather than a write fault. If the page wasn't dirty already, it will stay clean and with a non-writable PTE until it gets actually written to, at which point we'll get a write fault and do_wp_page will be invoked as usual. I am not proposing to skip the page_mkwrite() while upgrading the PTE permissions, which I think is what you were arguing against ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 23:31 ` Michel Lespinasse @ 2010-11-19 1:46 ` Dave Chinner -1 siblings, 0 replies; 56+ messages in thread From: Dave Chinner @ 2010-11-19 1:46 UTC (permalink / raw) To: Michel Lespinasse Cc: Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 03:31:37PM -0800, Michel Lespinasse wrote: > On Wed, Nov 17, 2010 at 3:11 PM, Dave Chinner <david@fromorbit.com> wrote: > >> Really, my understanding is that not pre-allocating filesystem blocks > >> is just fine. This is, after all, what happens with ext3 and it's > >> never been reported as a bug (that I know of). > > > > It's not ext3 you have to worry about - it's the filesystems that > > need special state set up on their pages/buffers for ->writepage to > > work correctly that are the problem. You need to call > > ->write_begin/->write_end to get the state set up properly. > > > > If this state is not set up properly, silent data loss will occur > > during mmap writes either by ENOSPC or failing to set up writes into > > unwritten extents correctly (i.e. we'll be back to where we were in > > 2.6.15). > > > > I don't think ->page_mkwrite can be worked around - we need that to > > be called on the first write fault of any mmap()d page to ensure it > > is set up correctly for writeback. If we don't get write faults > > after the page is mlock()d, then we need the ->page_mkwrite() call > > during the mlock() call. > > Just to be clear - I'm proposing to skip the entire do_wp_page() call > by doing a read fault rather than a write fault. If the page wasn't > dirty already, it will stay clean and with a non-writable PTE until it > gets actually written to, at which point we'll get a write fault and > do_wp_page will be invoked as usual. I have no problem with that - I'm surprised that mlock didn't work that way in the first place. > I am not proposing to skip the page_mkwrite() while upgrading the PTE > permissions, which I think is what you were arguing against ? I wasn't arguing against anything, merely pointing out that the ->page_mkwrite call is aboslutely necessary. You've made clarified that it still occurs, so I'm happy... FWIW, what I was responding to was the assumption that "this is alright for ext3, so it must be OK" extrapolation about ->page_mkwrite behaviour. Especially as ext3 does not even implement the ->page_mkwrite operation (which means mmap writes into holes can't detect ENOSPC correctly)... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-19 1:46 ` Dave Chinner 0 siblings, 0 replies; 56+ messages in thread From: Dave Chinner @ 2010-11-19 1:46 UTC (permalink / raw) To: Michel Lespinasse Cc: Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 03:31:37PM -0800, Michel Lespinasse wrote: > On Wed, Nov 17, 2010 at 3:11 PM, Dave Chinner <david@fromorbit.com> wrote: > >> Really, my understanding is that not pre-allocating filesystem blocks > >> is just fine. This is, after all, what happens with ext3 and it's > >> never been reported as a bug (that I know of). > > > > It's not ext3 you have to worry about - it's the filesystems that > > need special state set up on their pages/buffers for ->writepage to > > work correctly that are the problem. You need to call > > ->write_begin/->write_end to get the state set up properly. > > > > If this state is not set up properly, silent data loss will occur > > during mmap writes either by ENOSPC or failing to set up writes into > > unwritten extents correctly (i.e. we'll be back to where we were in > > 2.6.15). > > > > I don't think ->page_mkwrite can be worked around - we need that to > > be called on the first write fault of any mmap()d page to ensure it > > is set up correctly for writeback. If we don't get write faults > > after the page is mlock()d, then we need the ->page_mkwrite() call > > during the mlock() call. > > Just to be clear - I'm proposing to skip the entire do_wp_page() call > by doing a read fault rather than a write fault. If the page wasn't > dirty already, it will stay clean and with a non-writable PTE until it > gets actually written to, at which point we'll get a write fault and > do_wp_page will be invoked as usual. I have no problem with that - I'm surprised that mlock didn't work that way in the first place. > I am not proposing to skip the page_mkwrite() while upgrading the PTE > permissions, which I think is what you were arguing against ? I wasn't arguing against anything, merely pointing out that the ->page_mkwrite call is aboslutely necessary. You've made clarified that it still occurs, so I'm happy... FWIW, what I was responding to was the assumption that "this is alright for ext3, so it must be OK" extrapolation about ->page_mkwrite behaviour. Especially as ext3 does not even implement the ->page_mkwrite operation (which means mmap writes into holes can't detect ENOSPC correctly)... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 23:11 ` Dave Chinner @ 2010-11-17 23:52 ` Ted Ts'o -1 siblings, 0 replies; 56+ messages in thread From: Ted Ts'o @ 2010-11-17 23:52 UTC (permalink / raw) To: Dave Chinner Cc: Michel Lespinasse, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > I don't think ->page_mkwrite can be worked around - we need that to > be called on the first write fault of any mmap()d page to ensure it > is set up correctly for writeback. If we don't get write faults > after the page is mlock()d, then we need the ->page_mkwrite() call > during the mlock() call. OK, so I'm not an mm hacker, so maybe I'm missing something. Could part of this be fixed by simply sending the write faults for mlock()'ed pages, so page_mkwrite() gets called when the page is dirtied. Seems like a real waste to have the file system pre-allocate all of the blocks for a mlock()'ed region. Why does mlock() have to result in the write faults getting suppressed when the page is actually dirtied? - Ted ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-17 23:52 ` Ted Ts'o 0 siblings, 0 replies; 56+ messages in thread From: Ted Ts'o @ 2010-11-17 23:52 UTC (permalink / raw) To: Dave Chinner Cc: Michel Lespinasse, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > I don't think ->page_mkwrite can be worked around - we need that to > be called on the first write fault of any mmap()d page to ensure it > is set up correctly for writeback. If we don't get write faults > after the page is mlock()d, then we need the ->page_mkwrite() call > during the mlock() call. OK, so I'm not an mm hacker, so maybe I'm missing something. Could part of this be fixed by simply sending the write faults for mlock()'ed pages, so page_mkwrite() gets called when the page is dirtied. Seems like a real waste to have the file system pre-allocate all of the blocks for a mlock()'ed region. Why does mlock() have to result in the write faults getting suppressed when the page is actually dirtied? - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 23:52 ` Ted Ts'o @ 2010-11-18 0:53 ` Andrew Morton -1 siblings, 0 replies; 56+ messages in thread From: Andrew Morton @ 2010-11-18 0:53 UTC (permalink / raw) To: Ted Ts'o Cc: Dave Chinner, Michel Lespinasse, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, 17 Nov 2010 18:52:30 -0500 "Ted Ts'o" <tytso@mit.edu> wrote: > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > I don't think ->page_mkwrite can be worked around - we need that to > > be called on the first write fault of any mmap()d page to ensure it > > is set up correctly for writeback. If we don't get write faults > > after the page is mlock()d, then we need the ->page_mkwrite() call > > during the mlock() call. > > OK, so I'm not an mm hacker, so maybe I'm missing something. Could > part of this be fixed by simply sending the write faults for > mlock()'ed pages, so page_mkwrite() gets called when the page is > dirtied. Seems like a real waste to have the file system pre-allocate > all of the blocks for a mlock()'ed region. Why does mlock() have to > result in the write faults getting suppressed when the page is > actually dirtied? Yup, I don't think it would be too bad to take a minor fault each time an mlocked page transitions from clean->dirty. In fact we should already be doing that, after the mlocked page gets written back by kupdate? Hope so! ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-18 0:53 ` Andrew Morton 0 siblings, 0 replies; 56+ messages in thread From: Andrew Morton @ 2010-11-18 0:53 UTC (permalink / raw) To: Ted Ts'o Cc: Dave Chinner, Michel Lespinasse, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, 17 Nov 2010 18:52:30 -0500 "Ted Ts'o" <tytso@mit.edu> wrote: > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > I don't think ->page_mkwrite can be worked around - we need that to > > be called on the first write fault of any mmap()d page to ensure it > > is set up correctly for writeback. If we don't get write faults > > after the page is mlock()d, then we need the ->page_mkwrite() call > > during the mlock() call. > > OK, so I'm not an mm hacker, so maybe I'm missing something. Could > part of this be fixed by simply sending the write faults for > mlock()'ed pages, so page_mkwrite() gets called when the page is > dirtied. Seems like a real waste to have the file system pre-allocate > all of the blocks for a mlock()'ed region. Why does mlock() have to > result in the write faults getting suppressed when the page is > actually dirtied? Yup, I don't think it would be too bad to take a minor fault each time an mlocked page transitions from clean->dirty. In fact we should already be doing that, after the mlocked page gets written back by kupdate? Hope so! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-18 0:53 ` Andrew Morton @ 2010-11-18 11:03 ` Michel Lespinasse -1 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-18 11:03 UTC (permalink / raw) To: Andrew Morton Cc: Ted Ts'o, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 04:53:09PM -0800, Andrew Morton wrote: > On Wed, 17 Nov 2010 18:52:30 -0500 > "Ted Ts'o" <tytso@mit.edu> wrote: > > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > > I don't think ->page_mkwrite can be worked around - we need that to > > > be called on the first write fault of any mmap()d page to ensure it > > > is set up correctly for writeback. If we don't get write faults > > > after the page is mlock()d, then we need the ->page_mkwrite() call > > > during the mlock() call. > > > > OK, so I'm not an mm hacker, so maybe I'm missing something. Could > > part of this be fixed by simply sending the write faults for > > mlock()'ed pages, so page_mkwrite() gets called when the page is > > dirtied. Seems like a real waste to have the file system pre-allocate > > all of the blocks for a mlock()'ed region. Why does mlock() have to > > result in the write faults getting suppressed when the page is > > actually dirtied? This is actually what the patch does - by having mlock() use a read fault, pages are loaded in memory and mlocked, but the ptes are not marked as writable so that a later write access will be caught as a write fault at that time (with all the usual dirtying and page_mkwrite() callbacks). > Yup, I don't think it would be too bad to take a minor fault each time > an mlocked page transitions from clean->dirty. > > In fact we should already be doing that, after the mlocked page gets > written back by kupdate? Hope so! Yes, handle_mm_fault() is careful to never create writable ptes pointing to clean file pages, so that a later write fault will correctly dirty the corresponding page. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-18 11:03 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-18 11:03 UTC (permalink / raw) To: Andrew Morton Cc: Ted Ts'o, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 04:53:09PM -0800, Andrew Morton wrote: > On Wed, 17 Nov 2010 18:52:30 -0500 > "Ted Ts'o" <tytso@mit.edu> wrote: > > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > > I don't think ->page_mkwrite can be worked around - we need that to > > > be called on the first write fault of any mmap()d page to ensure it > > > is set up correctly for writeback. If we don't get write faults > > > after the page is mlock()d, then we need the ->page_mkwrite() call > > > during the mlock() call. > > > > OK, so I'm not an mm hacker, so maybe I'm missing something. Could > > part of this be fixed by simply sending the write faults for > > mlock()'ed pages, so page_mkwrite() gets called when the page is > > dirtied. Seems like a real waste to have the file system pre-allocate > > all of the blocks for a mlock()'ed region. Why does mlock() have to > > result in the write faults getting suppressed when the page is > > actually dirtied? This is actually what the patch does - by having mlock() use a read fault, pages are loaded in memory and mlocked, but the ptes are not marked as writable so that a later write access will be caught as a write fault at that time (with all the usual dirtying and page_mkwrite() callbacks). > Yup, I don't think it would be too bad to take a minor fault each time > an mlocked page transitions from clean->dirty. > > In fact we should already be doing that, after the mlocked page gets > written back by kupdate? Hope so! Yes, handle_mm_fault() is careful to never create writable ptes pointing to clean file pages, so that a later write fault will correctly dirty the corresponding page. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 23:11 ` Dave Chinner @ 2010-11-18 13:37 ` Christoph Hellwig -1 siblings, 0 replies; 56+ messages in thread From: Christoph Hellwig @ 2010-11-18 13:37 UTC (permalink / raw) To: Dave Chinner Cc: Michel Lespinasse, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > Hence I think that avoiding ->page_mkwrite callouts is likely to > break some filesystems in subtle, undetected ways. IMO, regardless > of what is done, it would be really good to start by writing a new > regression test to exercise and encode the expected the mlock > behaviour so we can detect regressions later on.... I think it would help if we could drink a bit of the test driven design coolaid here. Michel, can you write some testcases where pages on a shared mapping are mlocked, then dirtied and then munlocked, and then written out using msync/fsync. Anything that fails this test on btrfs/ext4/gfs/xfs/etc obviously doesn't work. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-18 13:37 ` Christoph Hellwig 0 siblings, 0 replies; 56+ messages in thread From: Christoph Hellwig @ 2010-11-18 13:37 UTC (permalink / raw) To: Dave Chinner Cc: Michel Lespinasse, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > Hence I think that avoiding ->page_mkwrite callouts is likely to > break some filesystems in subtle, undetected ways. IMO, regardless > of what is done, it would be really good to start by writing a new > regression test to exercise and encode the expected the mlock > behaviour so we can detect regressions later on.... I think it would help if we could drink a bit of the test driven design coolaid here. Michel, can you write some testcases where pages on a shared mapping are mlocked, then dirtied and then munlocked, and then written out using msync/fsync. Anything that fails this test on btrfs/ext4/gfs/xfs/etc obviously doesn't work. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-18 13:37 ` Christoph Hellwig @ 2010-11-18 17:41 ` Hugh Dickins -1 siblings, 0 replies; 56+ messages in thread From: Hugh Dickins @ 2010-11-18 17:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Michel Lespinasse, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, 18 Nov 2010, Christoph Hellwig wrote: > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > Hence I think that avoiding ->page_mkwrite callouts is likely to > > break some filesystems in subtle, undetected ways. IMO, regardless > > of what is done, it would be really good to start by writing a new > > regression test to exercise and encode the expected the mlock > > behaviour so we can detect regressions later on.... > > I think it would help if we could drink a bit of the test driven design > coolaid here. Michel, can you write some testcases where pages on a > shared mapping are mlocked, then dirtied and then munlocked, and then > written out using msync/fsync. Anything that fails this test on > btrfs/ext4/gfs/xfs/etc obviously doesn't work. Whilst it's hard to argue against a request for testing, Dave's worries just sprang from a misunderstanding of all the talk about "avoiding -> page_mkwrite". There's nothing strange or risky about Michel's patch, it does not avoid ->page_mkwrite when there is a write: it just stops pretending that there was a write when locking down the shared area. Hugh ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-18 17:41 ` Hugh Dickins 0 siblings, 0 replies; 56+ messages in thread From: Hugh Dickins @ 2010-11-18 17:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Michel Lespinasse, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, 18 Nov 2010, Christoph Hellwig wrote: > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > Hence I think that avoiding ->page_mkwrite callouts is likely to > > break some filesystems in subtle, undetected ways. IMO, regardless > > of what is done, it would be really good to start by writing a new > > regression test to exercise and encode the expected the mlock > > behaviour so we can detect regressions later on.... > > I think it would help if we could drink a bit of the test driven design > coolaid here. Michel, can you write some testcases where pages on a > shared mapping are mlocked, then dirtied and then munlocked, and then > written out using msync/fsync. Anything that fails this test on > btrfs/ext4/gfs/xfs/etc obviously doesn't work. Whilst it's hard to argue against a request for testing, Dave's worries just sprang from a misunderstanding of all the talk about "avoiding -> page_mkwrite". There's nothing strange or risky about Michel's patch, it does not avoid ->page_mkwrite when there is a write: it just stops pretending that there was a write when locking down the shared area. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-18 17:41 ` Hugh Dickins @ 2010-11-19 7:23 ` Michel Lespinasse -1 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-19 7:23 UTC (permalink / raw) To: Hugh Dickins Cc: Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote: > On Thu, 18 Nov 2010, Christoph Hellwig wrote: > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > > Hence I think that avoiding ->page_mkwrite callouts is likely to > > > break some filesystems in subtle, undetected ways. IMO, regardless > > > of what is done, it would be really good to start by writing a new > > > regression test to exercise and encode the expected the mlock > > > behaviour so we can detect regressions later on.... > > > > I think it would help if we could drink a bit of the test driven design > > coolaid here. Michel, can you write some testcases where pages on a > > shared mapping are mlocked, then dirtied and then munlocked, and then > > written out using msync/fsync. Anything that fails this test on > > btrfs/ext4/gfs/xfs/etc obviously doesn't work. I think it's still under debate what's an acceptable result for this test (i.e. what's supposed to happen during mlock of a shared mapping of a sparsely allocated file - is a fallocate equivalent supposed to happen ?) But I agree discussing based on test results will make things more concrete. > Whilst it's hard to argue against a request for testing, Dave's worries > just sprang from a misunderstanding of all the talk about "avoiding -> > page_mkwrite". There's nothing strange or risky about Michel's patch, > it does not avoid ->page_mkwrite when there is a write: it just stops > pretending that there was a write when locking down the shared area. So, I decided to test this using memtoy. /data is a separate partition where I had just 10GB free space, and /data/hole20G was created using dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0. memtoy>file /data/hole20G shared memtoy>map hole20G At this point the file is mapped using a writable, shared VMA. memtoy>touch hole20G memtoy: touched 5242880 pages in 30.595 secs At this point memtoy's address space is populated with zeroed pages. The pages are distinct (meminfo does show 20G of allocated pages), are classified as file pages, not anon, and are associated with the struct address_space for /data/hole20G. That file still does not have blocks allocated, as can be seen with du /data/hole20G. memtoy>lock hole20G memtoy tries to mlock the hole20G VMA. This is where things get interesting. Using ext2, without my patch (ext3 should be similar): - first, mlock does fast progress going though file pages, marking them as dirty. Eventually, it hits the dirty limit and gets throttled. - then, mlock does slow progress as it needs to wait for writeback. writeback occurs and allocates blocks for the /data/hole20G. Eventually, the /data partition gets full. - then, mlock does no progress as it's at the dirty limit and nothing gets written back. - mlock never terminates. Using ext2, with my patch (ext3 should be similar): - mlock goes through all pages in ~5 seconds, marking them as mlocked (but still not dirty) - mlock completes. /data/hole20G still does not have blocks allocated. - if memtoy is then instructed to dirty all the pages (using 'touch hole20G write'): - first, memtoy does fast progress faulting through file pages, marking them as dirty. Eventually, it hits the dirty limit and gets throttled. - then, memtoy does slow progress as it needs to wait for writeback. writeback occurs and allocates blocks for the /data/hole20G. Eventually, the /data partition gets full. - then, memtoy does no progress as it's at the dirty limit and nothing gets written back. It gets stuck into a write fault that never completes. - i.e. this is essentially the same lockup as without my patch, except that it occurs when the application tries to dirty the shared file rather than during mlock itself. Using ext4, without my patch: - first, mlock does fast progress going though file pages, marking them as dirty. Eventually, it hits the dirty limit and gets throttled. - then, mlock does slow progress as it needs to wait for writeback. writeback occurs and allocates blocks for the /data/hole20G. Eventually, the /data partition gets full. - then, mlock returns an error. Using ext4, with my patch: - mlock goes through all pages in ~5 seconds, marking them as mlocked (but still not dirty) - mlock completes. /data/hole20G still does not have blocks allocated. - if memtoy is then instructed to dirty all the pages (using 'touch hole20G write'): - first, memtoy does fast progress faulting through file pages, marking them as dirty. Eventually, it hits the dirty limit and gets throttled. - then, memtoy does slow progress as it needs to wait for writeback. writeback occurs and allocates blocks for the /data/hole20G. Eventually, the /data partition gets full. - at that point, memtoy dies of SIGBUS. - i.e. for filesystems that define the page_mkwrite callback, the mlock behavior when running out of space writing to sparse files is clearly nicer without my patch than with it. Not 100% sure what to make of these results. Approaching the problem the other way - would there be any objection to adding code to do an fallocate() equivalent at the start of mlock ? This would be a no-op when the file is fully allocated on disk, and would allow mlock to return an error if the file can't get fully allocated (no idea what errno should be for such case, though). -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-19 7:23 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-19 7:23 UTC (permalink / raw) To: Hugh Dickins Cc: Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote: > On Thu, 18 Nov 2010, Christoph Hellwig wrote: > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > > Hence I think that avoiding ->page_mkwrite callouts is likely to > > > break some filesystems in subtle, undetected ways. IMO, regardless > > > of what is done, it would be really good to start by writing a new > > > regression test to exercise and encode the expected the mlock > > > behaviour so we can detect regressions later on.... > > > > I think it would help if we could drink a bit of the test driven design > > coolaid here. Michel, can you write some testcases where pages on a > > shared mapping are mlocked, then dirtied and then munlocked, and then > > written out using msync/fsync. Anything that fails this test on > > btrfs/ext4/gfs/xfs/etc obviously doesn't work. I think it's still under debate what's an acceptable result for this test (i.e. what's supposed to happen during mlock of a shared mapping of a sparsely allocated file - is a fallocate equivalent supposed to happen ?) But I agree discussing based on test results will make things more concrete. > Whilst it's hard to argue against a request for testing, Dave's worries > just sprang from a misunderstanding of all the talk about "avoiding -> > page_mkwrite". There's nothing strange or risky about Michel's patch, > it does not avoid ->page_mkwrite when there is a write: it just stops > pretending that there was a write when locking down the shared area. So, I decided to test this using memtoy. /data is a separate partition where I had just 10GB free space, and /data/hole20G was created using dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0. memtoy>file /data/hole20G shared memtoy>map hole20G At this point the file is mapped using a writable, shared VMA. memtoy>touch hole20G memtoy: touched 5242880 pages in 30.595 secs At this point memtoy's address space is populated with zeroed pages. The pages are distinct (meminfo does show 20G of allocated pages), are classified as file pages, not anon, and are associated with the struct address_space for /data/hole20G. That file still does not have blocks allocated, as can be seen with du /data/hole20G. memtoy>lock hole20G memtoy tries to mlock the hole20G VMA. This is where things get interesting. Using ext2, without my patch (ext3 should be similar): - first, mlock does fast progress going though file pages, marking them as dirty. Eventually, it hits the dirty limit and gets throttled. - then, mlock does slow progress as it needs to wait for writeback. writeback occurs and allocates blocks for the /data/hole20G. Eventually, the /data partition gets full. - then, mlock does no progress as it's at the dirty limit and nothing gets written back. - mlock never terminates. Using ext2, with my patch (ext3 should be similar): - mlock goes through all pages in ~5 seconds, marking them as mlocked (but still not dirty) - mlock completes. /data/hole20G still does not have blocks allocated. - if memtoy is then instructed to dirty all the pages (using 'touch hole20G write'): - first, memtoy does fast progress faulting through file pages, marking them as dirty. Eventually, it hits the dirty limit and gets throttled. - then, memtoy does slow progress as it needs to wait for writeback. writeback occurs and allocates blocks for the /data/hole20G. Eventually, the /data partition gets full. - then, memtoy does no progress as it's at the dirty limit and nothing gets written back. It gets stuck into a write fault that never completes. - i.e. this is essentially the same lockup as without my patch, except that it occurs when the application tries to dirty the shared file rather than during mlock itself. Using ext4, without my patch: - first, mlock does fast progress going though file pages, marking them as dirty. Eventually, it hits the dirty limit and gets throttled. - then, mlock does slow progress as it needs to wait for writeback. writeback occurs and allocates blocks for the /data/hole20G. Eventually, the /data partition gets full. - then, mlock returns an error. Using ext4, with my patch: - mlock goes through all pages in ~5 seconds, marking them as mlocked (but still not dirty) - mlock completes. /data/hole20G still does not have blocks allocated. - if memtoy is then instructed to dirty all the pages (using 'touch hole20G write'): - first, memtoy does fast progress faulting through file pages, marking them as dirty. Eventually, it hits the dirty limit and gets throttled. - then, memtoy does slow progress as it needs to wait for writeback. writeback occurs and allocates blocks for the /data/hole20G. Eventually, the /data partition gets full. - at that point, memtoy dies of SIGBUS. - i.e. for filesystems that define the page_mkwrite callback, the mlock behavior when running out of space writing to sparse files is clearly nicer without my patch than with it. Not 100% sure what to make of these results. Approaching the problem the other way - would there be any objection to adding code to do an fallocate() equivalent at the start of mlock ? This would be a no-op when the file is fully allocated on disk, and would allow mlock to return an error if the file can't get fully allocated (no idea what errno should be for such case, though). -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-19 7:23 ` Michel Lespinasse (?) @ 2010-11-19 13:38 ` Theodore Tso -1 siblings, 0 replies; 56+ messages in thread From: Theodore Tso @ 2010-11-19 13:38 UTC (permalink / raw) To: Michel Lespinasse Cc: Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Michael Rubin, Suleiman Souhlal [-- Attachment #1: Type: text/plain, Size: 1269 bytes --] On Fri, Nov 19, 2010 at 2:23 AM, Michel Lespinasse <walken@google.com>wrote: > Approaching the problem the other way - would there be any objection to > adding code to do an fallocate() equivalent at the start of mlock ? > This would be a no-op when the file is fully allocated on disk, and would > allow mlock to return an error if the file can't get fully allocated > (no idea what errno should be for such case, though). > My vote would be against. If you if you mmap a sparse file and then try writing to it willy-nilly, bad things will happen. This is true without a mlock(). Where is it written that mlock() has anything to do with improving this situation? If userspace wants to call fallocate() before it calls mlock(), it should do that. And in fact, in most cases, userspace should be encouraged to do that. But having mlock() call fallocate() and then return ENOSPC if there's no room? That just makes me feel icky as all heck. Look, it was an accident / bug of the implementation that mlock() magically dirtied all these pages. It might have made some situations better, but I very much doubt applications depended upon it, and I'd really rather not perpetuate this particular magic side effect of the buggy implementation of mlock(). -- Ted [-- Attachment #2: Type: text/html, Size: 1669 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-19 7:23 ` Michel Lespinasse @ 2010-11-19 13:42 ` Theodore Tso -1 siblings, 0 replies; 56+ messages in thread From: Theodore Tso @ 2010-11-19 13:42 UTC (permalink / raw) To: Michel Lespinasse Cc: Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Michael Rubin, Suleiman Souhlal On Fri, Nov 19, 2010 at 2:23 AM, Michel Lespinasse <walken@google.com> wrote: > > Approaching the problem the other way - would there be any objection to > adding code to do an fallocate() equivalent at the start of mlock ? > This would be a no-op when the file is fully allocated on disk, and would > allow mlock to return an error if the file can't get fully allocated > (no idea what errno should be for such case, though). My vote would be against. If you if you mmap a sparse file and then try writing to it willy-nilly, bad things will happen. This is true without a mlock(). Where is it written that mlock() has anything to do with improving this situation? If userspace wants to call fallocate() before it calls mlock(), it should do that. And in fact, in most cases, userspace should probably be encouraged to do that. But having mlock() call fallocate() and then return ENOSPC if there's no room? Isn't it confusing that mlock() call ENOSPC? Doesn't that give you cognitive dissonance? It should because fundamentally mlock() has nothing to do with block allocation!! Read the API spec! Look, it was an accident / bug of the implementation that mlock() magically dirtied all these pages. It might have made some situations better, but I very much doubt applications depended upon it, and I'd really rather not perpetuate this particular magic side effect of the previously buggy implementation of mlock(). -- Ted ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-19 13:42 ` Theodore Tso 0 siblings, 0 replies; 56+ messages in thread From: Theodore Tso @ 2010-11-19 13:42 UTC (permalink / raw) To: Michel Lespinasse Cc: Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Michael Rubin, Suleiman Souhlal On Fri, Nov 19, 2010 at 2:23 AM, Michel Lespinasse <walken@google.com> wrote: > > Approaching the problem the other way - would there be any objection to > adding code to do an fallocate() equivalent at the start of mlock ? > This would be a no-op when the file is fully allocated on disk, and would > allow mlock to return an error if the file can't get fully allocated > (no idea what errno should be for such case, though). My vote would be against. If you if you mmap a sparse file and then try writing to it willy-nilly, bad things will happen. This is true without a mlock(). Where is it written that mlock() has anything to do with improving this situation? If userspace wants to call fallocate() before it calls mlock(), it should do that. And in fact, in most cases, userspace should probably be encouraged to do that. But having mlock() call fallocate() and then return ENOSPC if there's no room? Isn't it confusing that mlock() call ENOSPC? Doesn't that give you cognitive dissonance? It should because fundamentally mlock() has nothing to do with block allocation!! Read the API spec! Look, it was an accident / bug of the implementation that mlock() magically dirtied all these pages. It might have made some situations better, but I very much doubt applications depended upon it, and I'd really rather not perpetuate this particular magic side effect of the previously buggy implementation of mlock(). -- Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-19 13:42 ` Theodore Tso @ 2010-11-19 15:06 ` Christoph Hellwig -1 siblings, 0 replies; 56+ messages in thread From: Christoph Hellwig @ 2010-11-19 15:06 UTC (permalink / raw) To: Theodore Tso Cc: Michel Lespinasse, Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Michael Rubin, Suleiman Souhlal On Fri, Nov 19, 2010 at 08:42:05AM -0500, Theodore Tso wrote: > My vote would be against. ? If you if you mmap a sparse file and then > try writing to it willy-nilly, bad things will happen. ?This is true without > a mlock(). ? Where is it written that mlock() has anything to do with > improving this situation? Exactly. Allocating space has been a side-effect on a handfull filesystem for about 20 kernel releases. > If userspace wants to call fallocate() before it calls mlock(), it should > do that. ?And in fact, in most cases, userspace should probably be > encouraged to do that. ? But having mlock() call fallocate() and > then return ENOSPC if there's no room? Isn't it confusing that mlock() > call ENOSPC? Doesn't that give you cognitive dissonance? It should > because fundamentally mlock() has nothing to do with block allocation!! > Read the API spec! Indeed. There is no need to make mlock + flag a parallel-API to fallocate. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-19 15:06 ` Christoph Hellwig 0 siblings, 0 replies; 56+ messages in thread From: Christoph Hellwig @ 2010-11-19 15:06 UTC (permalink / raw) To: Theodore Tso Cc: Michel Lespinasse, Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Michael Rubin, Suleiman Souhlal On Fri, Nov 19, 2010 at 08:42:05AM -0500, Theodore Tso wrote: > My vote would be against. ? If you if you mmap a sparse file and then > try writing to it willy-nilly, bad things will happen. ?This is true without > a mlock(). ? Where is it written that mlock() has anything to do with > improving this situation? Exactly. Allocating space has been a side-effect on a handfull filesystem for about 20 kernel releases. > If userspace wants to call fallocate() before it calls mlock(), it should > do that. ?And in fact, in most cases, userspace should probably be > encouraged to do that. ? But having mlock() call fallocate() and > then return ENOSPC if there's no room? Isn't it confusing that mlock() > call ENOSPC? Doesn't that give you cognitive dissonance? It should > because fundamentally mlock() has nothing to do with block allocation!! > Read the API spec! Indeed. There is no need to make mlock + flag a parallel-API to fallocate. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-19 7:23 ` Michel Lespinasse @ 2010-11-19 22:54 ` Andrew Morton -1 siblings, 0 replies; 56+ messages in thread From: Andrew Morton @ 2010-11-19 22:54 UTC (permalink / raw) To: Michel Lespinasse Cc: Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, 18 Nov 2010 23:23:16 -0800 Michel Lespinasse <walken@google.com> wrote: > On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote: > > On Thu, 18 Nov 2010, Christoph Hellwig wrote: > > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > > > Hence I think that avoiding ->page_mkwrite callouts is likely to > > > > break some filesystems in subtle, undetected ways. IMO, regardless > > > > of what is done, it would be really good to start by writing a new > > > > regression test to exercise and encode the expected the mlock > > > > behaviour so we can detect regressions later on.... > > > > > > I think it would help if we could drink a bit of the test driven design > > > coolaid here. Michel, can you write some testcases where pages on a > > > shared mapping are mlocked, then dirtied and then munlocked, and then > > > written out using msync/fsync. Anything that fails this test on > > > btrfs/ext4/gfs/xfs/etc obviously doesn't work. > > I think it's still under debate what's an acceptable result for this test > (i.e. what's supposed to happen during mlock of a shared mapping of > a sparsely allocated file - is a fallocate equivalent supposed to happen ?) > But I agree discussing based on test results will make things more concrete. > > > Whilst it's hard to argue against a request for testing, Dave's worries > > just sprang from a misunderstanding of all the talk about "avoiding -> > > page_mkwrite". There's nothing strange or risky about Michel's patch, > > it does not avoid ->page_mkwrite when there is a write: it just stops > > pretending that there was a write when locking down the shared area. > > So, I decided to test this using memtoy. Wait. You *tested* the kernel? I dunno, kids these days... > /data is a separate partition > where I had just 10GB free space, and /data/hole20G was created using > dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0. > > memtoy>file /data/hole20G shared > memtoy>map hole20G > > At this point the file is mapped using a writable, shared VMA. > > memtoy>touch hole20G > memtoy: touched 5242880 pages in 30.595 secs > > At this point memtoy's address space is populated with zeroed > pages. The pages are distinct (meminfo does show 20G of allocated pages), > are classified as file pages, not anon, and are associated with the > struct address_space for /data/hole20G. That file still does not have > blocks allocated, as can be seen with du /data/hole20G. > > memtoy>lock hole20G > > memtoy tries to mlock the hole20G VMA. > This is where things get interesting. > > Using ext2, without my patch (ext3 should be similar): > - first, mlock does fast progress going though file pages, marking them > as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, mlock does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, mlock does no progress as it's at the dirty limit and nothing > gets written back. > - mlock never terminates. > > Using ext2, with my patch (ext3 should be similar): > - mlock goes through all pages in ~5 seconds, marking them as mlocked > (but still not dirty) > - mlock completes. /data/hole20G still does not have blocks allocated. > - if memtoy is then instructed to dirty all the pages > (using 'touch hole20G write'): > - first, memtoy does fast progress faulting through file pages, marking > them as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, memtoy does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, memtoy does no progress as it's at the dirty limit and nothing > gets written back. It gets stuck into a write fault that never > completes. > - i.e. this is essentially the same lockup as without my patch, except that > it occurs when the application tries to dirty the shared file rather than > during mlock itself. Seems to me that this bug is the first thing we should be looking at. > Using ext4, without my patch: > - first, mlock does fast progress going though file pages, marking them > as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, mlock does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, mlock returns an error. > > Using ext4, with my patch: > - mlock goes through all pages in ~5 seconds, marking them as mlocked > (but still not dirty) > - mlock completes. /data/hole20G still does not have blocks allocated. > - if memtoy is then instructed to dirty all the pages > (using 'touch hole20G write'): > - first, memtoy does fast progress faulting through file pages, marking > them as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, memtoy does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - at that point, memtoy dies of SIGBUS. > - i.e. for filesystems that define the page_mkwrite callback, the mlock > behavior when running out of space writing to sparse files is clearly > nicer without my patch than with it. > > > Not 100% sure what to make of these results. > > Approaching the problem the other way - would there be any objection to > adding code to do an fallocate() equivalent at the start of mlock ? > This would be a no-op when the file is fully allocated on disk, and would > allow mlock to return an error if the file can't get fully allocated > (no idea what errno should be for such case, though). Dirtying all that memory at mlock() time is pretty obnoxious. I'm inclined to agree that your patch implements the desirable behaviour: don't dirty the page, don't do block allocation. Take a fault at first-dirtying and do it then. This does degrade mlock a bit: the user will find that the first touch of an mlocked page can cause synchronous physical I/O, which isn't mlocky behaviour *at all*. But we have to be able to do this anyway - whenever the kupdate function writes back the dirty pages it has to mark them read-only again so the kernel knows when they get redirtied. I do agree that this will result in worse file layout for some reasonable userspace code patterns. But it was always that way, except for the eleven-release window where we kinda accidentally fixed that up in-kernel. Hopefully most apps which care are already ensuring that the file is well laid-out. So all that leaves me thinking that we merge your patches as-is. Then work out why users can fairly trivially use mlock to hang the kernel on ext2 and ext3 (and others?) ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-19 22:54 ` Andrew Morton 0 siblings, 0 replies; 56+ messages in thread From: Andrew Morton @ 2010-11-19 22:54 UTC (permalink / raw) To: Michel Lespinasse Cc: Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, 18 Nov 2010 23:23:16 -0800 Michel Lespinasse <walken@google.com> wrote: > On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote: > > On Thu, 18 Nov 2010, Christoph Hellwig wrote: > > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > > > Hence I think that avoiding ->page_mkwrite callouts is likely to > > > > break some filesystems in subtle, undetected ways. IMO, regardless > > > > of what is done, it would be really good to start by writing a new > > > > regression test to exercise and encode the expected the mlock > > > > behaviour so we can detect regressions later on.... > > > > > > I think it would help if we could drink a bit of the test driven design > > > coolaid here. Michel, can you write some testcases where pages on a > > > shared mapping are mlocked, then dirtied and then munlocked, and then > > > written out using msync/fsync. Anything that fails this test on > > > btrfs/ext4/gfs/xfs/etc obviously doesn't work. > > I think it's still under debate what's an acceptable result for this test > (i.e. what's supposed to happen during mlock of a shared mapping of > a sparsely allocated file - is a fallocate equivalent supposed to happen ?) > But I agree discussing based on test results will make things more concrete. > > > Whilst it's hard to argue against a request for testing, Dave's worries > > just sprang from a misunderstanding of all the talk about "avoiding -> > > page_mkwrite". There's nothing strange or risky about Michel's patch, > > it does not avoid ->page_mkwrite when there is a write: it just stops > > pretending that there was a write when locking down the shared area. > > So, I decided to test this using memtoy. Wait. You *tested* the kernel? I dunno, kids these days... > /data is a separate partition > where I had just 10GB free space, and /data/hole20G was created using > dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0. > > memtoy>file /data/hole20G shared > memtoy>map hole20G > > At this point the file is mapped using a writable, shared VMA. > > memtoy>touch hole20G > memtoy: touched 5242880 pages in 30.595 secs > > At this point memtoy's address space is populated with zeroed > pages. The pages are distinct (meminfo does show 20G of allocated pages), > are classified as file pages, not anon, and are associated with the > struct address_space for /data/hole20G. That file still does not have > blocks allocated, as can be seen with du /data/hole20G. > > memtoy>lock hole20G > > memtoy tries to mlock the hole20G VMA. > This is where things get interesting. > > Using ext2, without my patch (ext3 should be similar): > - first, mlock does fast progress going though file pages, marking them > as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, mlock does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, mlock does no progress as it's at the dirty limit and nothing > gets written back. > - mlock never terminates. > > Using ext2, with my patch (ext3 should be similar): > - mlock goes through all pages in ~5 seconds, marking them as mlocked > (but still not dirty) > - mlock completes. /data/hole20G still does not have blocks allocated. > - if memtoy is then instructed to dirty all the pages > (using 'touch hole20G write'): > - first, memtoy does fast progress faulting through file pages, marking > them as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, memtoy does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, memtoy does no progress as it's at the dirty limit and nothing > gets written back. It gets stuck into a write fault that never > completes. > - i.e. this is essentially the same lockup as without my patch, except that > it occurs when the application tries to dirty the shared file rather than > during mlock itself. Seems to me that this bug is the first thing we should be looking at. > Using ext4, without my patch: > - first, mlock does fast progress going though file pages, marking them > as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, mlock does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, mlock returns an error. > > Using ext4, with my patch: > - mlock goes through all pages in ~5 seconds, marking them as mlocked > (but still not dirty) > - mlock completes. /data/hole20G still does not have blocks allocated. > - if memtoy is then instructed to dirty all the pages > (using 'touch hole20G write'): > - first, memtoy does fast progress faulting through file pages, marking > them as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, memtoy does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - at that point, memtoy dies of SIGBUS. > - i.e. for filesystems that define the page_mkwrite callback, the mlock > behavior when running out of space writing to sparse files is clearly > nicer without my patch than with it. > > > Not 100% sure what to make of these results. > > Approaching the problem the other way - would there be any objection to > adding code to do an fallocate() equivalent at the start of mlock ? > This would be a no-op when the file is fully allocated on disk, and would > allow mlock to return an error if the file can't get fully allocated > (no idea what errno should be for such case, though). Dirtying all that memory at mlock() time is pretty obnoxious. I'm inclined to agree that your patch implements the desirable behaviour: don't dirty the page, don't do block allocation. Take a fault at first-dirtying and do it then. This does degrade mlock a bit: the user will find that the first touch of an mlocked page can cause synchronous physical I/O, which isn't mlocky behaviour *at all*. But we have to be able to do this anyway - whenever the kupdate function writes back the dirty pages it has to mark them read-only again so the kernel knows when they get redirtied. I do agree that this will result in worse file layout for some reasonable userspace code patterns. But it was always that way, except for the eleven-release window where we kinda accidentally fixed that up in-kernel. Hopefully most apps which care are already ensuring that the file is well laid-out. So all that leaves me thinking that we merge your patches as-is. Then work out why users can fairly trivially use mlock to hang the kernel on ext2 and ext3 (and others?) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-19 22:54 ` Andrew Morton @ 2010-11-19 23:22 ` Ted Ts'o -1 siblings, 0 replies; 56+ messages in thread From: Ted Ts'o @ 2010-11-19 23:22 UTC (permalink / raw) To: Andrew Morton Cc: Michel Lespinasse, Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal, Dustin Kirkland On Fri, Nov 19, 2010 at 02:54:42PM -0800, Andrew Morton wrote: > > Dirtying all that memory at mlock() time is pretty obnoxious. > ... > So all that leaves me thinking that we merge your patches as-is. Then > work out why users can fairly trivially use mlock to hang the kernel on > ext2 and ext3 (and others?) So at least on RHEL 4 and 5 systems, pam_limits was configured so that unprivileged processes could only mlock() at most 16k. This was deemed enough so that programs could protect crypto keys. The thinking when we added the mlock() ulimit setting was that unprivileged users could very easily make a nuisance of themselves, and grab way too much system resources, by using mlock() in obnoxious ways. I was just checking to see if my memory was correct, and to my surprise, I've just found that Ubuntu deliberately sets the memlock ulimit to be unlimited. Which means that Ubuntu systems are completely wide open for this particular DOS attack. So if you administer an Ubuntu-based server, it might be a good idea to make a tiny little change to /etc/security/limits.conf.... - Ted ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-19 23:22 ` Ted Ts'o 0 siblings, 0 replies; 56+ messages in thread From: Ted Ts'o @ 2010-11-19 23:22 UTC (permalink / raw) To: Andrew Morton Cc: Michel Lespinasse, Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal, Dustin Kirkland On Fri, Nov 19, 2010 at 02:54:42PM -0800, Andrew Morton wrote: > > Dirtying all that memory at mlock() time is pretty obnoxious. > ... > So all that leaves me thinking that we merge your patches as-is. Then > work out why users can fairly trivially use mlock to hang the kernel on > ext2 and ext3 (and others?) So at least on RHEL 4 and 5 systems, pam_limits was configured so that unprivileged processes could only mlock() at most 16k. This was deemed enough so that programs could protect crypto keys. The thinking when we added the mlock() ulimit setting was that unprivileged users could very easily make a nuisance of themselves, and grab way too much system resources, by using mlock() in obnoxious ways. I was just checking to see if my memory was correct, and to my surprise, I've just found that Ubuntu deliberately sets the memlock ulimit to be unlimited. Which means that Ubuntu systems are completely wide open for this particular DOS attack. So if you administer an Ubuntu-based server, it might be a good idea to make a tiny little change to /etc/security/limits.conf.... - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-19 23:22 ` Ted Ts'o (?) @ 2010-11-20 0:29 ` Dustin Kirkland -1 siblings, 0 replies; 56+ messages in thread From: Dustin Kirkland @ 2010-11-20 0:29 UTC (permalink / raw) To: Ted Ts'o, kees.cook Cc: Andrew Morton, Michel Lespinasse, Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal [-- Attachment #1: Type: text/plain, Size: 1456 bytes --] On Fri, 2010-11-19 at 18:22 -0500, Ted Ts'o wrote: > On Fri, Nov 19, 2010 at 02:54:42PM -0800, Andrew Morton wrote: > > > > Dirtying all that memory at mlock() time is pretty obnoxious. > > ... > > So all that leaves me thinking that we merge your patches as-is. Then > > work out why users can fairly trivially use mlock to hang the kernel on > > ext2 and ext3 (and others?) > > So at least on RHEL 4 and 5 systems, pam_limits was configured so that > unprivileged processes could only mlock() at most 16k. This was > deemed enough so that programs could protect crypto keys. The > thinking when we added the mlock() ulimit setting was that > unprivileged users could very easily make a nuisance of themselves, > and grab way too much system resources, by using mlock() in obnoxious > ways. > > I was just checking to see if my memory was correct, and to my > surprise, I've just found that Ubuntu deliberately sets the memlock > ulimit to be unlimited. Which means that Ubuntu systems are > completely wide open for this particular DOS attack. So if you > administer an Ubuntu-based server, it might be a good idea to make a > tiny little change to /etc/security/limits.conf.... > > - Ted Kees, Copying you into this thread, in case you'd like to respond from the Ubuntu side. Thanks for the heads-up, Ted. -- :-Dustin Dustin Kirkland Canonical, LTD kirkland@canonical.com GPG: 1024D/83A61194 [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-19 22:54 ` Andrew Morton @ 2010-11-19 23:31 ` Michel Lespinasse -1 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-19 23:31 UTC (permalink / raw) To: Andrew Morton Cc: Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Fri, Nov 19, 2010 at 2:54 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 18 Nov 2010 23:23:16 -0800 > Michel Lespinasse <walken@google.com> wrote: > >> On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote: >> > On Thu, 18 Nov 2010, Christoph Hellwig wrote: >> > > I think it would help if we could drink a bit of the test driven design >> > > coolaid here. Michel, can you write some testcases where pages on a >> > > shared mapping are mlocked, then dirtied and then munlocked, and then >> > > written out using msync/fsync. Anything that fails this test on >> > > btrfs/ext4/gfs/xfs/etc obviously doesn't work. >> > Whilst it's hard to argue against a request for testing, Dave's worries >> > just sprang from a misunderstanding of all the talk about "avoiding -> >> > page_mkwrite". There's nothing strange or risky about Michel's patch, >> > it does not avoid ->page_mkwrite when there is a write: it just stops >> > pretending that there was a write when locking down the shared area. >> >> So, I decided to test this using memtoy. > > Wait. You *tested* the kernel? > > I dunno, kids these days... Not guilty - I mean, Christoph made me do it ! > Dirtying all that memory at mlock() time is pretty obnoxious. > > I'm inclined to agree that your patch implements the desirable > behaviour: don't dirty the page, don't do block allocation. Take a > fault at first-dirtying and do it then. This does degrade mlock a bit: > the user will find that the first touch of an mlocked page can cause > synchronous physical I/O, which isn't mlocky behaviour *at all*. But > we have to be able to do this anyway - whenever the kupdate function > writes back the dirty pages it has to mark them read-only again so the > kernel knows when they get redirtied. Glad to see that we seem to be coming to an agreement here. > So all that leaves me thinking that we merge your patches as-is. Then > work out why users can fairly trivially use mlock to hang the kernel on > ext2 and ext3 (and others?) I would say the hang is not even mlock related - you see without it also. All you need is mmap a large file with holes and write fault pages until you run out of disk space. At that point additional write faults wait for a writeback that can never complete. Sysadmin can however kill -9 such processes and/or free some space, though. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-19 23:31 ` Michel Lespinasse 0 siblings, 0 replies; 56+ messages in thread From: Michel Lespinasse @ 2010-11-19 23:31 UTC (permalink / raw) To: Andrew Morton Cc: Hugh Dickins, Christoph Hellwig, Dave Chinner, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Fri, Nov 19, 2010 at 2:54 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 18 Nov 2010 23:23:16 -0800 > Michel Lespinasse <walken@google.com> wrote: > >> On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote: >> > On Thu, 18 Nov 2010, Christoph Hellwig wrote: >> > > I think it would help if we could drink a bit of the test driven design >> > > coolaid here. Michel, can you write some testcases where pages on a >> > > shared mapping are mlocked, then dirtied and then munlocked, and then >> > > written out using msync/fsync. Anything that fails this test on >> > > btrfs/ext4/gfs/xfs/etc obviously doesn't work. >> > Whilst it's hard to argue against a request for testing, Dave's worries >> > just sprang from a misunderstanding of all the talk about "avoiding -> >> > page_mkwrite". There's nothing strange or risky about Michel's patch, >> > it does not avoid ->page_mkwrite when there is a write: it just stops >> > pretending that there was a write when locking down the shared area. >> >> So, I decided to test this using memtoy. > > Wait. You *tested* the kernel? > > I dunno, kids these days... Not guilty - I mean, Christoph made me do it ! > Dirtying all that memory at mlock() time is pretty obnoxious. > > I'm inclined to agree that your patch implements the desirable > behaviour: don't dirty the page, don't do block allocation. Take a > fault at first-dirtying and do it then. This does degrade mlock a bit: > the user will find that the first touch of an mlocked page can cause > synchronous physical I/O, which isn't mlocky behaviour *at all*. But > we have to be able to do this anyway - whenever the kupdate function > writes back the dirty pages it has to mark them read-only again so the > kernel knows when they get redirtied. Glad to see that we seem to be coming to an agreement here. > So all that leaves me thinking that we merge your patches as-is. Then > work out why users can fairly trivially use mlock to hang the kernel on > ext2 and ext3 (and others?) I would say the hang is not even mlock related - you see without it also. All you need is mmap a large file with holes and write fault pages until you run out of disk space. At that point additional write faults wait for a writeback that can never complete. Sysadmin can however kill -9 such processes and/or free some space, though. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-19 7:23 ` Michel Lespinasse @ 2010-11-19 23:54 ` Dave Chinner -1 siblings, 0 replies; 56+ messages in thread From: Dave Chinner @ 2010-11-19 23:54 UTC (permalink / raw) To: Michel Lespinasse Cc: Hugh Dickins, Christoph Hellwig, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 11:23:16PM -0800, Michel Lespinasse wrote: > On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote: > > On Thu, 18 Nov 2010, Christoph Hellwig wrote: > > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > > > Hence I think that avoiding ->page_mkwrite callouts is likely to > > > > break some filesystems in subtle, undetected ways. IMO, regardless > > > > of what is done, it would be really good to start by writing a new > > > > regression test to exercise and encode the expected the mlock > > > > behaviour so we can detect regressions later on.... > > > > > > I think it would help if we could drink a bit of the test driven design > > > coolaid here. Michel, can you write some testcases where pages on a > > > shared mapping are mlocked, then dirtied and then munlocked, and then > > > written out using msync/fsync. Anything that fails this test on > > > btrfs/ext4/gfs/xfs/etc obviously doesn't work. > > I think it's still under debate what's an acceptable result for this test > (i.e. what's supposed to happen during mlock of a shared mapping of > a sparsely allocated file - is a fallocate equivalent supposed to happen ?) > But I agree discussing based on test results will make things more concrete. > > > Whilst it's hard to argue against a request for testing, Dave's worries > > just sprang from a misunderstanding of all the talk about "avoiding -> > > page_mkwrite". There's nothing strange or risky about Michel's patch, > > it does not avoid ->page_mkwrite when there is a write: it just stops > > pretending that there was a write when locking down the shared area. > > So, I decided to test this using memtoy. /data is a separate partition > where I had just 10GB free space, and /data/hole20G was created using > dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0. > > memtoy>file /data/hole20G shared > memtoy>map hole20G > > At this point the file is mapped using a writable, shared VMA. > > memtoy>touch hole20G > memtoy: touched 5242880 pages in 30.595 secs > > At this point memtoy's address space is populated with zeroed > pages. The pages are distinct (meminfo does show 20G of allocated pages), > are classified as file pages, not anon, and are associated with the > struct address_space for /data/hole20G. That file still does not have > blocks allocated, as can be seen with du /data/hole20G. > > memtoy>lock hole20G > > memtoy tries to mlock the hole20G VMA. > This is where things get interesting. > > Using ext2, without my patch (ext3 should be similar): > - first, mlock does fast progress going though file pages, marking them > as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, mlock does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, mlock does no progress as it's at the dirty limit and nothing > gets written back. > - mlock never terminates. That's the typical result of ENOSPC during mmap writes into holes on filesystems that don't implement ->page_mkwrite() - this effectively overcommits the filesystem free space with more dirty pages than can fit in the free space, and then the system will get stuck on ENOSPC errors when trying to allocate in the writeback path, leaving you with dirty pages that can't be cleaned... > Using ext2, with my patch (ext3 should be similar): .... > - i.e. this is essentially the same lockup as without my patch, except that > it occurs when the application tries to dirty the shared file rather than > during mlock itself. Same thing - your patch just moves the trigger. > Using ext4, without my patch: > - first, mlock does fast progress going though file pages, marking them > as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, mlock does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, mlock returns an error. Yup, ext4 implements page_mkwrite(), so can detect ENOSPC during the write page fault. > Using ext4, with my patch: > - mlock goes through all pages in ~5 seconds, marking them as mlocked > (but still not dirty) > - mlock completes. /data/hole20G still does not have blocks allocated. > - if memtoy is then instructed to dirty all the pages > (using 'touch hole20G write'): > - first, memtoy does fast progress faulting through file pages, marking > them as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, memtoy does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - at that point, memtoy dies of SIGBUS. > - i.e. for filesystems that define the page_mkwrite callback, the mlock > behavior when running out of space writing to sparse files is clearly > nicer without my patch than with it. Which is how all filesystems _should_ behave. Any filesystem that does not implement ->page_mkwrite will break under this test, regardless of your patch. It is exercising the exact problem that page_mkwrite was introduced to solve.... > Not 100% sure what to make of these results. > > Approaching the problem the other way - would there be any objection to > adding code to do an fallocate() equivalent at the start of mlock ? IMO, you do not need fallocate or any form of preallocation at all during mlock(). As I've already pointed out, this still requires filesystems to implement ->page_mkwrite to ensure mmap writes into preallocated regions work correctly. Fundamentally, filesystems need to implement ->page-mkwrite() to do the right thing w.r.t. mmap writes and ENOSPC, and any filesystem that doesn't is plain broken. You don't need to try to fix this problem by making mlock() jump through hoops - it'll be fixed by people implementing a working ->page_mkwrite function for their filesystem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-19 23:54 ` Dave Chinner 0 siblings, 0 replies; 56+ messages in thread From: Dave Chinner @ 2010-11-19 23:54 UTC (permalink / raw) To: Michel Lespinasse Cc: Hugh Dickins, Christoph Hellwig, Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 11:23:16PM -0800, Michel Lespinasse wrote: > On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote: > > On Thu, 18 Nov 2010, Christoph Hellwig wrote: > > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote: > > > > Hence I think that avoiding ->page_mkwrite callouts is likely to > > > > break some filesystems in subtle, undetected ways. IMO, regardless > > > > of what is done, it would be really good to start by writing a new > > > > regression test to exercise and encode the expected the mlock > > > > behaviour so we can detect regressions later on.... > > > > > > I think it would help if we could drink a bit of the test driven design > > > coolaid here. Michel, can you write some testcases where pages on a > > > shared mapping are mlocked, then dirtied and then munlocked, and then > > > written out using msync/fsync. Anything that fails this test on > > > btrfs/ext4/gfs/xfs/etc obviously doesn't work. > > I think it's still under debate what's an acceptable result for this test > (i.e. what's supposed to happen during mlock of a shared mapping of > a sparsely allocated file - is a fallocate equivalent supposed to happen ?) > But I agree discussing based on test results will make things more concrete. > > > Whilst it's hard to argue against a request for testing, Dave's worries > > just sprang from a misunderstanding of all the talk about "avoiding -> > > page_mkwrite". There's nothing strange or risky about Michel's patch, > > it does not avoid ->page_mkwrite when there is a write: it just stops > > pretending that there was a write when locking down the shared area. > > So, I decided to test this using memtoy. /data is a separate partition > where I had just 10GB free space, and /data/hole20G was created using > dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0. > > memtoy>file /data/hole20G shared > memtoy>map hole20G > > At this point the file is mapped using a writable, shared VMA. > > memtoy>touch hole20G > memtoy: touched 5242880 pages in 30.595 secs > > At this point memtoy's address space is populated with zeroed > pages. The pages are distinct (meminfo does show 20G of allocated pages), > are classified as file pages, not anon, and are associated with the > struct address_space for /data/hole20G. That file still does not have > blocks allocated, as can be seen with du /data/hole20G. > > memtoy>lock hole20G > > memtoy tries to mlock the hole20G VMA. > This is where things get interesting. > > Using ext2, without my patch (ext3 should be similar): > - first, mlock does fast progress going though file pages, marking them > as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, mlock does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, mlock does no progress as it's at the dirty limit and nothing > gets written back. > - mlock never terminates. That's the typical result of ENOSPC during mmap writes into holes on filesystems that don't implement ->page_mkwrite() - this effectively overcommits the filesystem free space with more dirty pages than can fit in the free space, and then the system will get stuck on ENOSPC errors when trying to allocate in the writeback path, leaving you with dirty pages that can't be cleaned... > Using ext2, with my patch (ext3 should be similar): .... > - i.e. this is essentially the same lockup as without my patch, except that > it occurs when the application tries to dirty the shared file rather than > during mlock itself. Same thing - your patch just moves the trigger. > Using ext4, without my patch: > - first, mlock does fast progress going though file pages, marking them > as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, mlock does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - then, mlock returns an error. Yup, ext4 implements page_mkwrite(), so can detect ENOSPC during the write page fault. > Using ext4, with my patch: > - mlock goes through all pages in ~5 seconds, marking them as mlocked > (but still not dirty) > - mlock completes. /data/hole20G still does not have blocks allocated. > - if memtoy is then instructed to dirty all the pages > (using 'touch hole20G write'): > - first, memtoy does fast progress faulting through file pages, marking > them as dirty. Eventually, it hits the dirty limit and gets throttled. > - then, memtoy does slow progress as it needs to wait for writeback. > writeback occurs and allocates blocks for the /data/hole20G. > Eventually, the /data partition gets full. > - at that point, memtoy dies of SIGBUS. > - i.e. for filesystems that define the page_mkwrite callback, the mlock > behavior when running out of space writing to sparse files is clearly > nicer without my patch than with it. Which is how all filesystems _should_ behave. Any filesystem that does not implement ->page_mkwrite will break under this test, regardless of your patch. It is exercising the exact problem that page_mkwrite was introduced to solve.... > Not 100% sure what to make of these results. > > Approaching the problem the other way - would there be any objection to > adding code to do an fallocate() equivalent at the start of mlock ? IMO, you do not need fallocate or any form of preallocation at all during mlock(). As I've already pointed out, this still requires filesystems to implement ->page_mkwrite to ensure mmap writes into preallocated regions work correctly. Fundamentally, filesystems need to implement ->page-mkwrite() to do the right thing w.r.t. mmap writes and ENOSPC, and any filesystem that doesn't is plain broken. You don't need to try to fix this problem by making mlock() jump through hoops - it'll be fixed by people implementing a working ->page_mkwrite function for their filesystem. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-17 15:28 ` Peter Zijlstra @ 2010-11-18 5:46 ` Nick Piggin -1 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2010-11-18 5:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 04:28:54PM +0100, Peter Zijlstra wrote: > On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote: > > On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: > > > When faulting in pages for mlock(), we want to break COW for anonymous > > > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is > > > no need to write-fault into VM_SHARED vmas since shared file pages can > > > be mlocked first and dirtied later, when/if they actually get written to. > > > Skipping the write fault is desirable, as we don't want to unnecessarily > > > cause these pages to be dirtied and queued for writeback. > > > > It's not just to break COW, but to do block allocation and such > > (filesystem's page_mkwrite op). That needs to at least be explained > > in the changelog. > > Agreed, the 0/3 description actually does mention this. Oh, missed that, but yes the changelog needs to. > > Filesystem doesn't have a good way to fully pin required things > > according to mlock, but page_mkwrite provides some reasonable things > > (like block allocation / reservation). > > Right, but marking all pages dirty isn't really sane. I can imagine > making the reservation but not marking things dirty solution, although > it might be lots harder to implement, esp since some filesystems don't > actually have a page_mkwrite() implementation. I think it is sane enough. Going back to previous behaviour would be a regression, wouldn't it? The right way to fix this would not be to introduce the new regression but either/both: a specific syscall to mlock-for-read which does not do any reservations, fix filesystem hook to allow reservation without implying dirtying. A simple flag to page_mkwrite will be enough (plus the logic to call it from VM). If an app has called mlock, presumably it doesn't want to SIGBUS from out of space, if it can possibly help it. If it isn't going to write to it, then PROT_READ would be appropriate. If not, then a note to man page maintainer, and an idea of performance improvement in an actual use case would be nice. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-18 5:46 ` Nick Piggin 0 siblings, 0 replies; 56+ messages in thread From: Nick Piggin @ 2010-11-18 5:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Wed, Nov 17, 2010 at 04:28:54PM +0100, Peter Zijlstra wrote: > On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote: > > On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote: > > > When faulting in pages for mlock(), we want to break COW for anonymous > > > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is > > > no need to write-fault into VM_SHARED vmas since shared file pages can > > > be mlocked first and dirtied later, when/if they actually get written to. > > > Skipping the write fault is desirable, as we don't want to unnecessarily > > > cause these pages to be dirtied and queued for writeback. > > > > It's not just to break COW, but to do block allocation and such > > (filesystem's page_mkwrite op). That needs to at least be explained > > in the changelog. > > Agreed, the 0/3 description actually does mention this. Oh, missed that, but yes the changelog needs to. > > Filesystem doesn't have a good way to fully pin required things > > according to mlock, but page_mkwrite provides some reasonable things > > (like block allocation / reservation). > > Right, but marking all pages dirty isn't really sane. I can imagine > making the reservation but not marking things dirty solution, although > it might be lots harder to implement, esp since some filesystems don't > actually have a page_mkwrite() implementation. I think it is sane enough. Going back to previous behaviour would be a regression, wouldn't it? The right way to fix this would not be to introduce the new regression but either/both: a specific syscall to mlock-for-read which does not do any reservations, fix filesystem hook to allow reservation without implying dirtying. A simple flag to page_mkwrite will be enough (plus the logic to call it from VM). If an app has called mlock, presumably it doesn't want to SIGBUS from out of space, if it can possibly help it. If it isn't going to write to it, then PROT_READ would be appropriate. If not, then a note to man page maintainer, and an idea of performance improvement in an actual use case would be nice. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-18 5:46 ` Nick Piggin @ 2010-11-18 10:43 ` Theodore Tso -1 siblings, 0 replies; 56+ messages in thread From: Theodore Tso @ 2010-11-18 10:43 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Nov 18, 2010, at 12:46 AM, Nick Piggin wrote: > The right way to fix this would not be to introduce the new regression > but either/both: a specific syscall to mlock-for-read which does not do > any reservations, fix filesystem hook to allow reservation without > implying dirtying. A simple flag to page_mkwrite will be enough (plus > the logic to call it from VM). Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that. Are there really programs that assume that mlock() == fallocate()?!? -- Ted ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-18 10:43 ` Theodore Tso 0 siblings, 0 replies; 56+ messages in thread From: Theodore Tso @ 2010-11-18 10:43 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Nov 18, 2010, at 12:46 AM, Nick Piggin wrote: > The right way to fix this would not be to introduce the new regression > but either/both: a specific syscall to mlock-for-read which does not do > any reservations, fix filesystem hook to allow reservation without > implying dirtying. A simple flag to page_mkwrite will be enough (plus > the logic to call it from VM). Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that. Are there really programs that assume that mlock() == fallocate()?!? -- Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-18 10:43 ` Theodore Tso @ 2010-11-18 13:39 ` Christoph Hellwig -1 siblings, 0 replies; 56+ messages in thread From: Christoph Hellwig @ 2010-11-18 13:39 UTC (permalink / raw) To: Theodore Tso Cc: Nick Piggin, Peter Zijlstra, Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 05:43:06AM -0500, Theodore Tso wrote: > Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that. > > Are there really programs that assume that mlock() == fallocate()?!? If there are programs that do they can't predate linux 2.6.15, and only work on btrfs/ext4/xfs/etc, but not ext2/ext3/reiserfs. Seems rather unlikely to me. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-18 13:39 ` Christoph Hellwig 0 siblings, 0 replies; 56+ messages in thread From: Christoph Hellwig @ 2010-11-18 13:39 UTC (permalink / raw) To: Theodore Tso Cc: Nick Piggin, Peter Zijlstra, Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, Nov 18, 2010 at 05:43:06AM -0500, Theodore Tso wrote: > Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that. > > Are there really programs that assume that mlock() == fallocate()?!? If there are programs that do they can't predate linux 2.6.15, and only work on btrfs/ext4/xfs/etc, but not ext2/ext3/reiserfs. Seems rather unlikely to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback 2010-11-18 13:39 ` Christoph Hellwig @ 2010-11-18 18:00 ` Hugh Dickins -1 siblings, 0 replies; 56+ messages in thread From: Hugh Dickins @ 2010-11-18 18:00 UTC (permalink / raw) To: Christoph Hellwig Cc: Theodore Tso, Nick Piggin, Peter Zijlstra, Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, 18 Nov 2010, Christoph Hellwig wrote: > On Thu, Nov 18, 2010 at 05:43:06AM -0500, Theodore Tso wrote: > > Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that. > > > > Are there really programs that assume that mlock() == fallocate()?!? > > If there are programs that do they can't predate linux 2.6.15, and only > work on btrfs/ext4/xfs/etc, but not ext2/ext3/reiserfs. Seems rather > unlikely to me. Yes, almost. I'm very much on this side, that mlocking should not dirty all those pages; but better admit one argument for the opposition - it's possible that we'd find a case somewhere, which has always (i.e. even pre- page_mkwrite) relied upon mlock of an entirely sparse file to result in a nicely ordered allocation of blocks to the file (as would often have happened with pdflush, I think), to give good sequential read patterns ever after; but with this patch would now get much more random block ordering, according to where the real writes actually fall. It would be possible for a filesystem's ->fault(vma, &vmf) to observe that it's being called on a VM_LOCKED|VM_SHARED vma, and make sure that the page has backing in that case, to reproduce the old allocation behaviour without all the unnecessary writing. But that would be extra work in every filesystem that cares. Hugh ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback @ 2010-11-18 18:00 ` Hugh Dickins 0 siblings, 0 replies; 56+ messages in thread From: Hugh Dickins @ 2010-11-18 18:00 UTC (permalink / raw) To: Christoph Hellwig Cc: Theodore Tso, Nick Piggin, Peter Zijlstra, Michel Lespinasse, linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Kosaki Motohiro, Theodore Tso, Michael Rubin, Suleiman Souhlal On Thu, 18 Nov 2010, Christoph Hellwig wrote: > On Thu, Nov 18, 2010 at 05:43:06AM -0500, Theodore Tso wrote: > > Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that. > > > > Are there really programs that assume that mlock() == fallocate()?!? > > If there are programs that do they can't predate linux 2.6.15, and only > work on btrfs/ext4/xfs/etc, but not ext2/ext3/reiserfs. Seems rather > unlikely to me. Yes, almost. I'm very much on this side, that mlocking should not dirty all those pages; but better admit one argument for the opposition - it's possible that we'd find a case somewhere, which has always (i.e. even pre- page_mkwrite) relied upon mlock of an entirely sparse file to result in a nicely ordered allocation of blocks to the file (as would often have happened with pdflush, I think), to give good sequential read patterns ever after; but with this patch would now get much more random block ordering, according to where the real writes actually fall. It would be possible for a filesystem's ->fault(vma, &vmf) to observe that it's being called on a VM_LOCKED|VM_SHARED vma, and make sure that the page has backing in that case, to reproduce the old allocation behaviour without all the unnecessary writing. But that would be extra work in every filesystem that cares. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2010-11-20 0:30 UTC | newest] Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-11-17 12:23 [PATCH 0/3] Avoid dirtying pages during mlock Michel Lespinasse 2010-11-17 12:23 ` Michel Lespinasse 2010-11-17 12:23 ` [PATCH 1/3] do_wp_page: remove the 'reuse' flag Michel Lespinasse 2010-11-17 12:23 ` Michel Lespinasse 2010-11-17 12:23 ` [PATCH 2/3] do_wp_page: clarify dirty_page handling Michel Lespinasse 2010-11-17 12:23 ` Michel Lespinasse 2010-11-17 12:23 ` [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback Michel Lespinasse 2010-11-17 12:23 ` Michel Lespinasse 2010-11-17 12:57 ` Nick Piggin 2010-11-17 12:57 ` Nick Piggin 2010-11-17 15:28 ` Peter Zijlstra 2010-11-17 15:28 ` Peter Zijlstra 2010-11-17 22:05 ` Michel Lespinasse 2010-11-17 22:05 ` Michel Lespinasse 2010-11-17 22:18 ` Peter Zijlstra 2010-11-17 22:18 ` Peter Zijlstra 2010-11-17 23:11 ` Dave Chinner 2010-11-17 23:11 ` Dave Chinner 2010-11-17 23:31 ` Michel Lespinasse 2010-11-17 23:31 ` Michel Lespinasse 2010-11-19 1:46 ` Dave Chinner 2010-11-19 1:46 ` Dave Chinner 2010-11-17 23:52 ` Ted Ts'o 2010-11-17 23:52 ` Ted Ts'o 2010-11-18 0:53 ` Andrew Morton 2010-11-18 0:53 ` Andrew Morton 2010-11-18 11:03 ` Michel Lespinasse 2010-11-18 11:03 ` Michel Lespinasse 2010-11-18 13:37 ` Christoph Hellwig 2010-11-18 13:37 ` Christoph Hellwig 2010-11-18 17:41 ` Hugh Dickins 2010-11-18 17:41 ` Hugh Dickins 2010-11-19 7:23 ` Michel Lespinasse 2010-11-19 7:23 ` Michel Lespinasse 2010-11-19 13:38 ` Theodore Tso 2010-11-19 13:42 ` Theodore Tso 2010-11-19 13:42 ` Theodore Tso 2010-11-19 15:06 ` Christoph Hellwig 2010-11-19 15:06 ` Christoph Hellwig 2010-11-19 22:54 ` Andrew Morton 2010-11-19 22:54 ` Andrew Morton 2010-11-19 23:22 ` Ted Ts'o 2010-11-19 23:22 ` Ted Ts'o 2010-11-20 0:29 ` Dustin Kirkland 2010-11-19 23:31 ` Michel Lespinasse 2010-11-19 23:31 ` Michel Lespinasse 2010-11-19 23:54 ` Dave Chinner 2010-11-19 23:54 ` Dave Chinner 2010-11-18 5:46 ` Nick Piggin 2010-11-18 5:46 ` Nick Piggin 2010-11-18 10:43 ` Theodore Tso 2010-11-18 10:43 ` Theodore Tso 2010-11-18 13:39 ` Christoph Hellwig 2010-11-18 13:39 ` Christoph Hellwig 2010-11-18 18:00 ` Hugh Dickins 2010-11-18 18:00 ` Hugh Dickins
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.