On Fri, 24 Mar 2023, Jiaqi Yan wrote: > On Fri, Mar 24, 2023 at 2:15 PM Yang Shi wrote: > > On Sat, Mar 4, 2023 at 10:51 PM Jiaqi Yan wrote: > > > > > > Make collapse_file roll back when copying pages failed. More concretely: > > > - extract copying operations into a separate loop > > > - postpone the updates for nr_none until both scanning and copying > > > succeeded > > > - postpone joining small xarray entries until both scanning and copying > > > succeeded > > > - postpone the update operations to NR_XXX_THPS until both scanning and > > > copying succeeded > > > - for non-SHMEM file, roll back filemap_nr_thps_inc if scan succeeded but > > > copying failed > > > > > > Tested manually: > > > 0. Enable khugepaged on system under test. Mount tmpfs at /mnt/ramdisk. > > > 1. Start a two-thread application. Each thread allocates a chunk of > > > non-huge memory buffer from /mnt/ramdisk. > > > 2. Pick 4 random buffer address (2 in each thread) and inject > > > uncorrectable memory errors at physical addresses. > > > 3. Signal both threads to make their memory buffer collapsible, i.e. > > > calling madvise(MADV_HUGEPAGE). > > > 4. Wait and then check kernel log: khugepaged is able to recover from > > > poisoned pages by skipping them. > > > 5. Signal both threads to inspect their buffer contents and make sure no > > > data corruption. > > > > > > Signed-off-by: Jiaqi Yan > > > > Reviewed-by: Yang Shi > > > > Just a nit below: Acked-by: Hugh Dickins with a little nit from me below, if you are respinning: > > > > > --- > > > mm/khugepaged.c | 78 ++++++++++++++++++++++++++++++------------------- > > > 1 file changed, 48 insertions(+), 30 deletions(-) > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index c3c217f6ebc6e..3ea2aa55c2c52 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -1890,6 +1890,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > { > > > struct address_space *mapping = file->f_mapping; > > > struct page *hpage; > > > + struct page *page; > > > + struct page *tmp; > > > + struct folio *folio; > > > pgoff_t index = 0, end = start + HPAGE_PMD_NR; > > > LIST_HEAD(pagelist); > > > XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER); > > > @@ -1934,8 +1937,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > > > xas_set(&xas, start); > > > for (index = start; index < end; index++) { > > > - struct page *page = xas_next(&xas); > > > - struct folio *folio; > > > + page = xas_next(&xas); > > > > > > VM_BUG_ON(index != xas.xa_index); > > > if (is_shmem) { > > > @@ -2117,10 +2119,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > } > > > nr = thp_nr_pages(hpage); > > > > > > - if (is_shmem) > > > - __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr); > > > - else { > > > - __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr); > > > + if (!is_shmem) { > > > filemap_nr_thps_inc(mapping); > > > /* > > > * Paired with smp_mb() in do_dentry_open() to ensure That "nr = thp_nr_pages(hpage);" above becomes stranded a long way away from where "nr" is actually used for updating those statistics: please move it down with them. (I see "nr" is also reported in the tracepoint at the end, FWIW, so maybe that will show "0" in more failure cases than it used to, but that's okay - it has been decently initialized.) Thanks, Hugh