* Re: [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock
[not found] ` <20210604011844.1756145-4-ruansy.fnst@fujitsu.com>
@ 2021-06-16 0:48 ` Dan Williams
2021-06-17 6:51 ` ruansy.fnst
0 siblings, 1 reply; 7+ messages in thread
From: Dan Williams @ 2021-06-16 0:48 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Linux Kernel Mailing List, linux-xfs, Linux MM, linux-fsdevel,
device-mapper development, Darrick J. Wong, david,
Christoph Hellwig, Alasdair Kergon, Mike Snitzer,
Goldwyn Rodrigues, Linux NVDIMM
[ drop old linux-nvdimm@lists.01.org, add nvdimm@lists.linux.dev ]
On Thu, Jun 3, 2021 at 6:19 PM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
>
> Memory failure occurs in fsdax mode will finally be handled in
> filesystem. We introduce this interface to find out files or metadata
> affected by the corrupted range, and try to recover the corrupted data
> if possiable.
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> include/linux/fs.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c3c88fdb9b2a..92af36c4225f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2176,6 +2176,8 @@ struct super_operations {
> struct shrink_control *);
> long (*free_cached_objects)(struct super_block *,
> struct shrink_control *);
> + int (*corrupted_range)(struct super_block *sb, struct block_device *bdev,
> + loff_t offset, size_t len, void *data);
Why does the superblock need a new operation? Wouldn't whatever
function is specified here just be specified to the dax_dev as the
->notify_failure() holder callback?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping
[not found] ` <20210604011844.1756145-5-ruansy.fnst@fujitsu.com>
@ 2021-06-16 6:30 ` Dan Williams
2021-06-17 7:51 ` ruansy.fnst
0 siblings, 1 reply; 7+ messages in thread
From: Dan Williams @ 2021-06-16 6:30 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Linux Kernel Mailing List, linux-xfs, Linux MM, linux-fsdevel,
device-mapper development, Darrick J. Wong, david,
Christoph Hellwig, Alasdair Kergon, Mike Snitzer,
Goldwyn Rodrigues, Linux NVDIMM
[ drop old nvdimm list, add the new one ]
On Thu, Jun 3, 2021 at 6:19 PM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
>
> The current memory_failure_dev_pagemap() can only handle single-mapped
> dax page for fsdax mode. The dax page could be mapped by multiple files
> and offsets if we let reflink feature & fsdax mode work together. So,
> we refactor current implementation to support handle memory failure on
> each file and offset.
I don't understand this organization, perhaps because this patch
introduces mf_dax_kill_procs() without a user. However, my expectation
is that memory_failure() is mostly untouched save for an early check
via pgmap->notify_memory_failure(). If pgmap->notify_memory_failure()
indicates that the memory failure was handled by the pgmap->owner /
dax_dev holder stack, then the typical memory failure path is
short-circuited. Otherwise, for non-reflink filesystems where
page->mapping() is valid the legacy / existing memory_failure()
operates as it does currently. If reflink capable filesystems want to
share a common implementation to map pfns to files they can, but I
don't think that common code belongs in mm/memory-failure.c.
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> fs/dax.c | 21 ++++++++
> include/linux/dax.h | 1 +
> include/linux/mm.h | 9 ++++
> mm/memory-failure.c | 114 ++++++++++++++++++++++++++++----------------
> 4 files changed, 105 insertions(+), 40 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 62352cbcf0f4..58faca85455a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -389,6 +389,27 @@ static struct page *dax_busy_page(void *entry)
> return NULL;
> }
>
> +/*
> + * dax_load_pfn - Load pfn of the DAX entry corresponding to a page
> + * @mapping: The file whose entry we want to load
> + * @index: The offset where the DAX entry located in
> + *
> + * Return: pfn of the DAX entry
> + */
> +unsigned long dax_load_pfn(struct address_space *mapping, unsigned long index)
> +{
> + XA_STATE(xas, &mapping->i_pages, index);
> + void *entry;
> + unsigned long pfn;
> +
> + xas_lock_irq(&xas);
> + entry = xas_load(&xas);
> + pfn = dax_to_pfn(entry);
> + xas_unlock_irq(&xas);
This looks racy, what happened to the locking afforded by dax_lock_page()?
> +
> + return pfn;
> +}
> +
> /*
> * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
> * @page: The page whose entry we want to lock
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 1ce343a960ab..6e758daa5004 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -158,6 +158,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>
> struct page *dax_layout_busy_page(struct address_space *mapping);
> struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
> +unsigned long dax_load_pfn(struct address_space *mapping, unsigned long index);
> dax_entry_t dax_lock_page(struct page *page);
> void dax_unlock_page(struct page *page, dax_entry_t cookie);
> #else
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c274f75efcf9..2b7527e93c77 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1187,6 +1187,14 @@ static inline bool is_device_private_page(const struct page *page)
> page->pgmap->type == MEMORY_DEVICE_PRIVATE;
> }
>
> +static inline bool is_device_fsdax_page(const struct page *page)
> +{
> + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> + IS_ENABLED(CONFIG_FS_DAX) &&
> + is_zone_device_page(page) &&
> + page->pgmap->type == MEMORY_DEVICE_FS_DAX;
Why is this necessary? The dax_dev holder is the one that knows the
nature of the pfn. The common memory_failure() code should not care
about fsdax vs devdax.
> +}
> +
> static inline bool is_pci_p2pdma_page(const struct page *page)
> {
> return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> @@ -3078,6 +3086,7 @@ enum mf_flags {
> MF_MUST_KILL = 1 << 2,
> MF_SOFT_OFFLINE = 1 << 3,
> };
> +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, int flags);
> extern int memory_failure(unsigned long pfn, int flags);
> extern void memory_failure_queue(unsigned long pfn, int flags);
> extern void memory_failure_queue_kick(int cpu);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 85ad98c00fd9..4377e727d478 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -56,6 +56,7 @@
> #include <linux/kfifo.h>
> #include <linux/ratelimit.h>
> #include <linux/page-isolation.h>
> +#include <linux/dax.h>
> #include "internal.h"
> #include "ras/ras_event.h"
>
> @@ -120,6 +121,13 @@ static int hwpoison_filter_dev(struct page *p)
> if (PageSlab(p))
> return -EINVAL;
>
> + if (pfn_valid(page_to_pfn(p))) {
> + if (is_device_fsdax_page(p))
This is racy unless the page is pinned. Also, not clear why this is needed?
> + return 0;
> + else
> + return -EINVAL;
> + }
> +
> mapping = page_mapping(p);
> if (mapping == NULL || mapping->host == NULL)
> return -EINVAL;
> @@ -290,10 +298,9 @@ void shake_page(struct page *p, int access)
> }
> EXPORT_SYMBOL_GPL(shake_page);
>
> -static unsigned long dev_pagemap_mapping_shift(struct page *page,
> - struct vm_area_struct *vma)
> +static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
> + unsigned long address)
> {
> - unsigned long address = vma_address(page, vma);
> pgd_t *pgd;
> p4d_t *p4d;
> pud_t *pud;
> @@ -333,9 +340,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
> * Schedule a process for later kill.
> * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> */
> -static void add_to_kill(struct task_struct *tsk, struct page *p,
> - struct vm_area_struct *vma,
> - struct list_head *to_kill)
> +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
> + struct vm_area_struct *vma, struct list_head *to_kill)
> {
> struct to_kill *tk;
>
> @@ -346,9 +352,12 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
> }
>
> tk->addr = page_address_in_vma(p, vma);
> - if (is_zone_device_page(p))
> - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> - else
> + if (is_zone_device_page(p)) {
> + if (is_device_fsdax_page(p))
> + tk->addr = vma->vm_start +
> + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> + tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
What was wrong with the original code?
> + } else
> tk->size_shift = page_shift(compound_head(p));
>
> /*
> @@ -496,7 +505,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> if (!page_mapped_in_vma(page, vma))
> continue;
> if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, 0, vma, to_kill);
> }
> }
> read_unlock(&tasklist_lock);
> @@ -506,24 +515,19 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> /*
> * Collect processes when the error hit a file mapped page.
> */
> -static void collect_procs_file(struct page *page, struct list_head *to_kill,
> - int force_early)
> +static void collect_procs_file(struct page *page, struct address_space *mapping,
> + pgoff_t pgoff, struct list_head *to_kill, int force_early)
> {
collect_procs() and kill_procs() are the only core memory_failure()
helpers I expect would be exported for a fileystem dax_dev holder to
call when it is trying to cleanup a memory_failure() on a reflink'd
mapping.
> struct vm_area_struct *vma;
> struct task_struct *tsk;
> - struct address_space *mapping = page->mapping;
> - pgoff_t pgoff;
>
> i_mmap_lock_read(mapping);
> read_lock(&tasklist_lock);
> - pgoff = page_to_pgoff(page);
> for_each_process(tsk) {
> struct task_struct *t = task_early_kill(tsk, force_early);
> -
> if (!t)
> continue;
> - vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
> - pgoff) {
> + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> /*
> * Send early kill signal to tasks where a vma covers
> * the page but the corrupted page is not necessarily
> @@ -532,7 +536,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
> * to be informed of all such data corruptions.
> */
> if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, pgoff, vma, to_kill);
> }
> }
> read_unlock(&tasklist_lock);
> @@ -551,7 +555,8 @@ static void collect_procs(struct page *page, struct list_head *tokill,
> if (PageAnon(page))
> collect_procs_anon(page, tokill, force_early);
> else
> - collect_procs_file(page, tokill, force_early);
> + collect_procs_file(page, page_mapping(page), page_to_pgoff(page),
> + tokill, force_early);
> }
>
> static const char *action_name[] = {
> @@ -1218,6 +1223,51 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
> return 0;
> }
>
> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> + struct address_space *mapping, pgoff_t index, int flags)
> +{
> + struct to_kill *tk;
> + unsigned long size = 0;
> + loff_t start;
> +
> + list_for_each_entry(tk, to_kill, nd)
> + if (tk->size_shift)
> + size = max(size, 1UL << tk->size_shift);
> + if (size) {
> + /*
> + * Unmap the largest mapping to avoid breaking up
> + * device-dax mappings which are constant size. The
> + * actual size of the mapping being torn down is
> + * communicated in siginfo, see kill_proc()
> + */
> + start = (index << PAGE_SHIFT) & ~(size - 1);
> + unmap_mapping_range(mapping, start, size, 0);
> + }
> +
> + kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
> +}
> +
> +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, int flags)
> +{
> + LIST_HEAD(to_kill);
> + /* load the pfn of the dax mapping file */
> + unsigned long pfn = dax_load_pfn(mapping, index);
> +
> + /*
> + * Unlike System-RAM there is no possibility to swap in a
> + * different physical page at a given virtual address, so all
> + * userspace consumption of ZONE_DEVICE memory necessitates
> + * SIGBUS (i.e. MF_MUST_KILL)
> + */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> + collect_procs_file(pfn_to_page(pfn), mapping, index, &to_kill,
> + flags & MF_ACTION_REQUIRED);
> +
> + unmap_and_kill(&to_kill, pfn, mapping, index, flags);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> +
> static int memory_failure_hugetlb(unsigned long pfn, int flags)
> {
> struct page *p = pfn_to_page(pfn);
> @@ -1298,12 +1348,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> struct dev_pagemap *pgmap)
> {
> struct page *page = pfn_to_page(pfn);
> - const bool unmap_success = true;
> - unsigned long size = 0;
> - struct to_kill *tk;
> - LIST_HEAD(tokill);
> + LIST_HEAD(to_kill);
> int rc = -EBUSY;
> - loff_t start;
> dax_entry_t cookie;
>
> if (flags & MF_COUNT_INCREASED)
> @@ -1355,22 +1401,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> * SIGBUS (i.e. MF_MUST_KILL)
> */
> flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> - collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
> + collect_procs_file(page, page->mapping, page->index, &to_kill,
> + flags & MF_ACTION_REQUIRED);
>
> - list_for_each_entry(tk, &tokill, nd)
> - if (tk->size_shift)
> - size = max(size, 1UL << tk->size_shift);
> - if (size) {
> - /*
> - * Unmap the largest mapping to avoid breaking up
> - * device-dax mappings which are constant size. The
> - * actual size of the mapping being torn down is
> - * communicated in siginfo, see kill_proc()
> - */
> - start = (page->index << PAGE_SHIFT) & ~(size - 1);
> - unmap_mapping_range(page->mapping, start, size, 0);
> - }
> - kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
> + unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
There's just too much change in this patch and not enough
justification of what is being refactored and why.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 05/10] mm, pmem: Implement ->memory_failure() in pmem driver
[not found] ` <20210604011844.1756145-6-ruansy.fnst@fujitsu.com>
@ 2021-06-16 6:49 ` Dan Williams
0 siblings, 0 replies; 7+ messages in thread
From: Dan Williams @ 2021-06-16 6:49 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Linux Kernel Mailing List, linux-xfs, Linux MM, linux-fsdevel,
device-mapper development, Darrick J. Wong, david,
Christoph Hellwig, Alasdair Kergon, Mike Snitzer,
Goldwyn Rodrigues, Linux NVDIMM
On Thu, Jun 3, 2021 at 6:19 PM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
>
> Call the ->memory_failure() which is implemented by pmem driver, in
> order to finally notify filesystem to handle the corrupted data. The
> handler which collects and kills processes are moved into
> mf_dax_kill_procs(), which will be called by filesystem.
>
> Keep the old handler in order to roll back if driver or filesystem
> does not support ->memory_failure()/->corrupted_range().
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> block/genhd.c | 30 ++++++++++++++++++
> drivers/nvdimm/pmem.c | 14 +++++++++
> include/linux/genhd.h | 1 +
> mm/memory-failure.c | 71 +++++++++++++++++++++++++++----------------
I would not expect a patch that converts the pmem driver to touch
mm/memory-failure.c. mf_generic_kill_procs() has nothing to do with
the pmem driver.
> 4 files changed, 90 insertions(+), 26 deletions(-)
>
> diff --git a/block/genhd.c b/block/genhd.c
> index 9f8cb7beaad1..75834bd057df 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -718,6 +718,36 @@ struct block_device *bdget_disk(struct gendisk *disk, int partno)
> return bdev;
> }
>
> +/**
> + * bdget_disk_sector - get block device by given sector number
> + * @disk: gendisk of interest
> + * @sector: sector number
> + *
> + * RETURNS: the found block device where sector locates in
> + */
> +struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector)
> +{
> + struct block_device *part = NULL, *p;
> + unsigned long idx;
> +
> + rcu_read_lock();
> + xa_for_each(&disk->part_tbl, idx, p) {
> + if (p->bd_partno == 0)
> + continue;
> + if (p->bd_start_sect <= sector &&
> + sector < p->bd_start_sect + bdev_nr_sectors(p)) {
> + part = p;
> + break;
> + }
> + }
> + rcu_read_unlock();
> + if (!part)
> + part = disk->part0;
> +
> + return bdget_disk(disk, part->bd_partno);
> +}
> +EXPORT_SYMBOL(bdget_disk_sector);
I can't see any justification for this function. The pmem driver does
not need it and the upper layer holders should have their own
mechanism to go from dax_dev offset to bdev if they need to, but
hopefully they don't. A dax_device failure should be independent of a
block_device failure.
> +
> /*
> * print a full list of all partitions - intended for places where the root
> * filesystem can't be mounted and thus to give the victim some idea of what
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index ed10a8b66068..98349e7d0a28 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -364,9 +364,23 @@ static void pmem_release_disk(void *__pmem)
> put_disk(pmem->disk);
> }
>
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> + unsigned long pfn, int flags)
> +{
> + struct pmem_device *pdev =
> + container_of(pgmap, struct pmem_device, pgmap);
This local variable is called @pmem not @pdev in other parts of the driver.
> + loff_t offset = PFN_PHYS(pfn) - pdev->phys_addr - pdev->data_offset;
> + struct block_device *bdev =
> + bdget_disk_sector(pdev->disk, offset >> SECTOR_SHIFT);
> +
> + return dax_corrupted_range(pdev->dax_dev, bdev, offset,
> + page_size(pfn_to_page(pfn)), &flags);
Per previous comments this interface should be range based. Why is the
last argument &flags? That was passed in by value so any changes to it
will not be reflected back to memory_failure()?
> +}
> +
> static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> .kill = pmem_pagemap_kill,
> .cleanup = pmem_pagemap_cleanup,
> + .memory_failure = pmem_pagemap_memory_failure,
> };
>
> static int pmem_attach_disk(struct device *dev,
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 6fc26f7bdf71..2ad70c02c343 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -219,6 +219,7 @@ static inline void add_disk_no_queue_reg(struct gendisk *disk)
>
> extern void del_gendisk(struct gendisk *gp);
> extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
> +extern struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector);
>
> void set_disk_ro(struct gendisk *disk, bool read_only);
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 4377e727d478..43017d7f3918 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1247,6 +1247,36 @@ static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
> }
>
> +static int mf_generic_kill_procs(unsigned long long pfn, int flags)
> +{
> + struct page *page = pfn_to_page(pfn);
> + LIST_HEAD(to_kill);
> + dax_entry_t cookie;
> +
> + /*
> + * Prevent the inode from being freed while we are interrogating
> + * the address_space, typically this would be handled by
> + * lock_page(), but dax pages do not use the page lock. This
> + * also prevents changes to the mapping of this pfn until
> + * poison signaling is complete.
> + */
> + cookie = dax_lock_page(page);
> + if (!cookie)
> + return -EBUSY;
> + /*
> + * Unlike System-RAM there is no possibility to swap in a
> + * different physical page at a given virtual address, so all
> + * userspace consumption of ZONE_DEVICE memory necessitates
> + * SIGBUS (i.e. MF_MUST_KILL)
> + */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> + collect_procs(page, &to_kill, flags & MF_ACTION_REQUIRED);
> +
> + unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
> + dax_unlock_page(page, cookie);
> + return 0;
> +}
> +
Per above, I am surprised to find this in this patch, it belong in its
own patch if at all.
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, int flags)
> {
> LIST_HEAD(to_kill);
> @@ -1348,9 +1378,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> struct dev_pagemap *pgmap)
> {
> struct page *page = pfn_to_page(pfn);
> - LIST_HEAD(to_kill);
> int rc = -EBUSY;
> - dax_entry_t cookie;
>
> if (flags & MF_COUNT_INCREASED)
> /*
> @@ -1364,20 +1392,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> goto out;
> }
>
> - /*
> - * Prevent the inode from being freed while we are interrogating
> - * the address_space, typically this would be handled by
> - * lock_page(), but dax pages do not use the page lock. This
> - * also prevents changes to the mapping of this pfn until
> - * poison signaling is complete.
> - */
> - cookie = dax_lock_page(page);
> - if (!cookie)
> - goto out;
> -
> if (hwpoison_filter(page)) {
> rc = 0;
> - goto unlock;
> + goto out;
> }
>
> if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> @@ -1385,7 +1402,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> * TODO: Handle HMM pages which may need coordination
> * with device-side memory.
> */
> - goto unlock;
> + goto out;
> }
>
> /*
> @@ -1395,19 +1412,21 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> SetPageHWPoison(page);
>
> /*
> - * Unlike System-RAM there is no possibility to swap in a
> - * different physical page at a given virtual address, so all
> - * userspace consumption of ZONE_DEVICE memory necessitates
> - * SIGBUS (i.e. MF_MUST_KILL)
> + * Call driver's implementation to handle the memory failure,
> + * otherwise roll back to generic handler.
> */
> - flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> - collect_procs_file(page, page->mapping, page->index, &to_kill,
> - flags & MF_ACTION_REQUIRED);
> + if (pgmap->ops->memory_failure) {
> + rc = pgmap->ops->memory_failure(pgmap, pfn, flags);
> + /*
> + * Roll back to generic handler too if operation is not
> + * supported inside the driver/device/filesystem.
> + */
> + if (rc != EOPNOTSUPP)
> + goto out;
> + }
> +
> + rc = mf_generic_kill_procs(pfn, flags);
>
> - unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
> - rc = 0;
> -unlock:
> - dax_unlock_page(page, cookie);
> out:
> /* drop pgmap ref acquired in caller */
> put_dev_pagemap(pgmap);
> --
> 2.31.1
>
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock
2021-06-16 0:48 ` [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock Dan Williams
@ 2021-06-17 6:51 ` ruansy.fnst
2021-06-17 7:04 ` Dan Williams
0 siblings, 1 reply; 7+ messages in thread
From: ruansy.fnst @ 2021-06-17 6:51 UTC (permalink / raw)
To: Dan Williams
Cc: Linux Kernel Mailing List, linux-xfs, Linux MM, linux-fsdevel,
device-mapper development, Darrick J. Wong, david,
Christoph Hellwig, Alasdair Kergon, Mike Snitzer,
Goldwyn Rodrigues, Linux NVDIMM
> -----Original Message-----
> From: Dan Williams <dan.j.williams@intel.com>
> Subject: Re: [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock
>
> [ drop old linux-nvdimm@lists.01.org, add nvdimm@lists.linux.dev ]
>
> On Thu, Jun 3, 2021 at 6:19 PM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
> >
> > Memory failure occurs in fsdax mode will finally be handled in
> > filesystem. We introduce this interface to find out files or metadata
> > affected by the corrupted range, and try to recover the corrupted data
> > if possiable.
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> > include/linux/fs.h | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h index
> > c3c88fdb9b2a..92af36c4225f 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2176,6 +2176,8 @@ struct super_operations {
> > struct shrink_control *);
> > long (*free_cached_objects)(struct super_block *,
> > struct shrink_control *);
> > + int (*corrupted_range)(struct super_block *sb, struct block_device
> *bdev,
> > + loff_t offset, size_t len, void *data);
>
> Why does the superblock need a new operation? Wouldn't whatever function is
> specified here just be specified to the dax_dev as the
> ->notify_failure() holder callback?
Because we need to find out which file is effected by the given poison page so that memory-failure code can do collect_procs() and kill_procs() jobs. And it needs filesystem to use its rmap feature to search the file from a given offset. So, we need this implemented by the specified filesystem and called by dax_device's holder.
This is the call trace I described in cover letter:
memory_failure()
* fsdax case
pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
dax_device->holder_ops->corrupted_range() =>
- fs_dax_corrupted_range()
- md_dax_corrupted_range()
sb->s_ops->currupted_range() => xfs_fs_corrupted_range() <== **HERE**
xfs_rmap_query_range()
xfs_currupt_helper()
* corrupted on metadata
try to recover data, call xfs_force_shutdown()
* corrupted on file data
try to recover data, call mf_dax_kill_procs()
* normal case
mf_generic_kill_procs()
As you can see, this new added operation is an important for the whole progress.
--
Thanks,
Ruan.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock
2021-06-17 6:51 ` ruansy.fnst
@ 2021-06-17 7:04 ` Dan Williams
2021-06-17 8:12 ` ruansy.fnst
0 siblings, 1 reply; 7+ messages in thread
From: Dan Williams @ 2021-06-17 7:04 UTC (permalink / raw)
To: ruansy.fnst
Cc: Linux Kernel Mailing List, linux-xfs, Linux MM, linux-fsdevel,
device-mapper development, Darrick J. Wong, david,
Christoph Hellwig, Alasdair Kergon, Mike Snitzer,
Goldwyn Rodrigues, Linux NVDIMM
On Wed, Jun 16, 2021 at 11:51 PM ruansy.fnst@fujitsu.com
<ruansy.fnst@fujitsu.com> wrote:
>
> > -----Original Message-----
> > From: Dan Williams <dan.j.williams@intel.com>
> > Subject: Re: [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock
> >
> > [ drop old linux-nvdimm@lists.01.org, add nvdimm@lists.linux.dev ]
> >
> > On Thu, Jun 3, 2021 at 6:19 PM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
> > >
> > > Memory failure occurs in fsdax mode will finally be handled in
> > > filesystem. We introduce this interface to find out files or metadata
> > > affected by the corrupted range, and try to recover the corrupted data
> > > if possiable.
> > >
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > ---
> > > include/linux/fs.h | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h index
> > > c3c88fdb9b2a..92af36c4225f 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2176,6 +2176,8 @@ struct super_operations {
> > > struct shrink_control *);
> > > long (*free_cached_objects)(struct super_block *,
> > > struct shrink_control *);
> > > + int (*corrupted_range)(struct super_block *sb, struct block_device
> > *bdev,
> > > + loff_t offset, size_t len, void *data);
> >
> > Why does the superblock need a new operation? Wouldn't whatever function is
> > specified here just be specified to the dax_dev as the
> > ->notify_failure() holder callback?
>
> Because we need to find out which file is effected by the given poison page so that memory-failure code can do collect_procs() and kill_procs() jobs. And it needs filesystem to use its rmap feature to search the file from a given offset. So, we need this implemented by the specified filesystem and called by dax_device's holder.
>
> This is the call trace I described in cover letter:
> memory_failure()
> * fsdax case
> pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
> dax_device->holder_ops->corrupted_range() =>
> - fs_dax_corrupted_range()
> - md_dax_corrupted_range()
> sb->s_ops->currupted_range() => xfs_fs_corrupted_range() <== **HERE**
> xfs_rmap_query_range()
> xfs_currupt_helper()
> * corrupted on metadata
> try to recover data, call xfs_force_shutdown()
> * corrupted on file data
> try to recover data, call mf_dax_kill_procs()
> * normal case
> mf_generic_kill_procs()
>
> As you can see, this new added operation is an important for the whole progress.
I don't think you need either fs_dax_corrupted_range() nor
sb->s_ops->corrupted_range(). In fact that fs_dax_corrupted_range()
looks broken because the filesystem may not even be mounted on the
device associated with the error. The holder_data and holder_op should
be sufficient from communicating the stack of notifications:
pgmap->notify_memory_failure() => pmem_pgmap_notify_failure()
pmem_dax_dev->holder_ops->notify_failure(pmem_dax_dev) =>
md_dax_notify_failure()
md_dax_dev->holder_ops->notify_failure() => xfs_notify_failure()
I.e. the entire chain just walks dax_dev holder ops.
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH v4 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping
2021-06-16 6:30 ` [PATCH v4 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping Dan Williams
@ 2021-06-17 7:51 ` ruansy.fnst
0 siblings, 0 replies; 7+ messages in thread
From: ruansy.fnst @ 2021-06-17 7:51 UTC (permalink / raw)
To: Dan Williams
Cc: Linux Kernel Mailing List, linux-xfs, Linux MM, linux-fsdevel,
device-mapper development, Darrick J. Wong, david,
Christoph Hellwig, Alasdair Kergon, Mike Snitzer,
Goldwyn Rodrigues, Linux NVDIMM
> -----Original Message-----
> From: Dan Williams <dan.j.williams@intel.com>
> Subject: Re: [PATCH v4 04/10] mm, fsdax: Refactor memory-failure handler for
> dax mapping
>
> [ drop old nvdimm list, add the new one ]
>
> On Thu, Jun 3, 2021 at 6:19 PM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
> >
> > The current memory_failure_dev_pagemap() can only handle single-mapped
> > dax page for fsdax mode. The dax page could be mapped by multiple
> > files and offsets if we let reflink feature & fsdax mode work
> > together. So, we refactor current implementation to support handle
> > memory failure on each file and offset.
>
> I don't understand this organization, perhaps because this patch introduces
> mf_dax_kill_procs() without a user.
Yes, I think I made it a mess... I should reorganize this whole patchset.
The mf_dax_kill_procs() is used by xfs in patch 9. I was mean to refactor these code for the next patches usage.
> However, my expectation is that
> memory_failure() is mostly untouched save for an early check via
> pgmap->notify_memory_failure(). If pgmap->notify_memory_failure() indicates
> that the memory failure was handled by the pgmap->owner / dax_dev holder
> stack, then the typical memory failure path is short-circuited. Otherwise, for
> non-reflink filesystems where
> page->mapping() is valid the legacy / existing memory_failure()
> operates as it does currently.
You can find this logic in patch 5.
When it comes to memory-failure() and memory_failure_dev_pagemap(), after some check, it will try to call pgmap->ops->memory_failure(). If this interface is implemented, for example pgmap is pmem device, it will call the dax_dev holder stack. And finally, it comes to mf_dax_kill_procs().
However, if something wrong happens in this stack, such as feature not support or interface not implemented, it will roll back to normal memory-failure hanlder which is refactored as mf_generic_kill_porcs().
So, I think we are in agreement on this.
Let me reorganize these code.
> If reflink capable filesystems want to share a
> common implementation to map pfns to files they can, but I don't think that
> common code belongs in mm/memory-failure.c.
>
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> > fs/dax.c | 21 ++++++++
> > include/linux/dax.h | 1 +
> > include/linux/mm.h | 9 ++++
> > mm/memory-failure.c | 114
> > ++++++++++++++++++++++++++++----------------
> > 4 files changed, 105 insertions(+), 40 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 62352cbcf0f4..58faca85455a 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -389,6 +389,27 @@ static struct page *dax_busy_page(void *entry)
> > return NULL;
> > }
> >
> > +/*
> > + * dax_load_pfn - Load pfn of the DAX entry corresponding to a page
> > + * @mapping: The file whose entry we want to load
> > + * @index: The offset where the DAX entry located in
> > + *
> > + * Return: pfn of the DAX entry
> > + */
> > +unsigned long dax_load_pfn(struct address_space *mapping, unsigned
> > +long index) {
> > + XA_STATE(xas, &mapping->i_pages, index);
> > + void *entry;
> > + unsigned long pfn;
> > +
> > + xas_lock_irq(&xas);
> > + entry = xas_load(&xas);
> > + pfn = dax_to_pfn(entry);
> > + xas_unlock_irq(&xas);
>
> This looks racy, what happened to the locking afforded by dax_lock_page()?
>
> > +
> > + return pfn;
> > +}
> > +
> > /*
> > * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
> > * @page: The page whose entry we want to lock diff --git
> > a/include/linux/dax.h b/include/linux/dax.h index
> > 1ce343a960ab..6e758daa5004 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -158,6 +158,7 @@ int dax_writeback_mapping_range(struct
> > address_space *mapping,
> >
> > struct page *dax_layout_busy_page(struct address_space *mapping);
> > struct page *dax_layout_busy_page_range(struct address_space *mapping,
> > loff_t start, loff_t end);
> > +unsigned long dax_load_pfn(struct address_space *mapping, unsigned
> > +long index);
> > dax_entry_t dax_lock_page(struct page *page); void
> > dax_unlock_page(struct page *page, dax_entry_t cookie); #else diff
> > --git a/include/linux/mm.h b/include/linux/mm.h index
> > c274f75efcf9..2b7527e93c77 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1187,6 +1187,14 @@ static inline bool is_device_private_page(const
> struct page *page)
> > page->pgmap->type == MEMORY_DEVICE_PRIVATE; }
> >
> > +static inline bool is_device_fsdax_page(const struct page *page) {
> > + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> > + IS_ENABLED(CONFIG_FS_DAX) &&
> > + is_zone_device_page(page) &&
> > + page->pgmap->type == MEMORY_DEVICE_FS_DAX;
>
> Why is this necessary? The dax_dev holder is the one that knows the nature of
> the pfn. The common memory_failure() code should not care about fsdax vs
> devdax.
add_to_kill() in collect_procs() needs this. Please see explanation at below.
>
> > +}
> > +
> > static inline bool is_pci_p2pdma_page(const struct page *page) {
> > return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && @@
> -3078,6
> > +3086,7 @@ enum mf_flags {
> > MF_MUST_KILL = 1 << 2,
> > MF_SOFT_OFFLINE = 1 << 3,
> > };
> > +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > +int flags);
> > extern int memory_failure(unsigned long pfn, int flags); extern void
> > memory_failure_queue(unsigned long pfn, int flags); extern void
> > memory_failure_queue_kick(int cpu); diff --git a/mm/memory-failure.c
> > b/mm/memory-failure.c index 85ad98c00fd9..4377e727d478 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -56,6 +56,7 @@
> > #include <linux/kfifo.h>
> > #include <linux/ratelimit.h>
> > #include <linux/page-isolation.h>
> > +#include <linux/dax.h>
> > #include "internal.h"
> > #include "ras/ras_event.h"
> >
> > @@ -120,6 +121,13 @@ static int hwpoison_filter_dev(struct page *p)
> > if (PageSlab(p))
> > return -EINVAL;
> >
> > + if (pfn_valid(page_to_pfn(p))) {
> > + if (is_device_fsdax_page(p))
>
> This is racy unless the page is pinned. Also, not clear why this is needed?
>
> > + return 0;
> > + else
> > + return -EINVAL;
> > + }
> > +
> > mapping = page_mapping(p);
> > if (mapping == NULL || mapping->host == NULL)
> > return -EINVAL;
> > @@ -290,10 +298,9 @@ void shake_page(struct page *p, int access) }
> > EXPORT_SYMBOL_GPL(shake_page);
> >
> > -static unsigned long dev_pagemap_mapping_shift(struct page *page,
> > - struct vm_area_struct *vma)
> > +static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct
> *vma,
> > + unsigned long
> address)
> > {
> > - unsigned long address = vma_address(page, vma);
> > pgd_t *pgd;
> > p4d_t *p4d;
> > pud_t *pud;
> > @@ -333,9 +340,8 @@ static unsigned long
> dev_pagemap_mapping_shift(struct page *page,
> > * Schedule a process for later kill.
> > * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> > */
> > -static void add_to_kill(struct task_struct *tsk, struct page *p,
> > - struct vm_area_struct *vma,
> > - struct list_head *to_kill)
> > +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
> > + struct vm_area_struct *vma, struct list_head
> > +*to_kill)
> > {
> > struct to_kill *tk;
> >
> > @@ -346,9 +352,12 @@ static void add_to_kill(struct task_struct *tsk, struct
> page *p,
> > }
> >
> > tk->addr = page_address_in_vma(p, vma);
> > - if (is_zone_device_page(p))
> > - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> > - else
> > + if (is_zone_device_page(p)) {
> > + if (is_device_fsdax_page(p))
> > + tk->addr = vma->vm_start +
> > + ((pgoff - vma->vm_pgoff) <<
> PAGE_SHIFT);
> > + tk->size_shift = dev_pagemap_mapping_shift(vma,
> > + tk->addr);
>
> What was wrong with the original code?
Here is the explanation: For normal page, it associate file's mapping and offset to page->mapping, page->index for rmap tracking. But for fsdax, in order to support reflink, we no longer use this mechanism, using dax_device holder stack instead. Thus, this dax page->mapping is NULL. As a result, we need is_device_fsdax_page(p) to distinguish if a page is a fsdax page and calculate this tk->addr manually.
>
> > + } else
> > tk->size_shift = page_shift(compound_head(p));
> >
> > /*
> > @@ -496,7 +505,7 @@ static void collect_procs_anon(struct page *page,
> struct list_head *to_kill,
> > if (!page_mapped_in_vma(page, vma))
> > continue;
> > if (vma->vm_mm == t->mm)
> > - add_to_kill(t, page, vma, to_kill);
> > + add_to_kill(t, page, 0, vma, to_kill);
> > }
> > }
> > read_unlock(&tasklist_lock);
> > @@ -506,24 +515,19 @@ static void collect_procs_anon(struct page
> > *page, struct list_head *to_kill,
> > /*
> > * Collect processes when the error hit a file mapped page.
> > */
> > -static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > - int force_early)
> > +static void collect_procs_file(struct page *page, struct address_space
> *mapping,
> > + pgoff_t pgoff, struct list_head *to_kill, int
> > +force_early)
> > {
>
> collect_procs() and kill_procs() are the only core memory_failure() helpers I
> expect would be exported for a fileystem dax_dev holder to call when it is trying
> to cleanup a memory_failure() on a reflink'd mapping.
Yes, they are the core we need. But there are some small different when dealing with normal page and dax page. So, I factor these two core functions into two helpers. One is mf_generic_kill_procs() for normal page. Another is mf_dax_kill_procs() for dax page.
>
> > struct vm_area_struct *vma;
> > struct task_struct *tsk;
> > - struct address_space *mapping = page->mapping;
> > - pgoff_t pgoff;
> >
> > i_mmap_lock_read(mapping);
> > read_lock(&tasklist_lock);
> > - pgoff = page_to_pgoff(page);
> > for_each_process(tsk) {
> > struct task_struct *t = task_early_kill(tsk,
> > force_early);
> > -
> > if (!t)
> > continue;
> > - vma_interval_tree_foreach(vma, &mapping->i_mmap,
> pgoff,
> > - pgoff) {
> > + vma_interval_tree_foreach(vma, &mapping->i_mmap,
> > + pgoff, pgoff) {
> > /*
> > * Send early kill signal to tasks where a vma
> covers
> > * the page but the corrupted page is not
> > necessarily @@ -532,7 +536,7 @@ static void collect_procs_file(struct page
> *page, struct list_head *to_kill,
> > * to be informed of all such data corruptions.
> > */
> > if (vma->vm_mm == t->mm)
> > - add_to_kill(t, page, vma, to_kill);
> > + add_to_kill(t, page, pgoff, vma,
> > + to_kill);
> > }
> > }
> > read_unlock(&tasklist_lock);
> > @@ -551,7 +555,8 @@ static void collect_procs(struct page *page, struct
> list_head *tokill,
> > if (PageAnon(page))
> > collect_procs_anon(page, tokill, force_early);
> > else
> > - collect_procs_file(page, tokill, force_early);
> > + collect_procs_file(page, page_mapping(page),
> page_to_pgoff(page),
> > + tokill, force_early);
> > }
> >
> > static const char *action_name[] = {
> > @@ -1218,6 +1223,51 @@ static int try_to_split_thp_page(struct page *page,
> const char *msg)
> > return 0;
> > }
> >
> > +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> > + struct address_space *mapping, pgoff_t index, int
> > +flags) {
> > + struct to_kill *tk;
> > + unsigned long size = 0;
> > + loff_t start;
> > +
> > + list_for_each_entry(tk, to_kill, nd)
> > + if (tk->size_shift)
> > + size = max(size, 1UL << tk->size_shift);
> > + if (size) {
> > + /*
> > + * Unmap the largest mapping to avoid breaking up
> > + * device-dax mappings which are constant size. The
> > + * actual size of the mapping being torn down is
> > + * communicated in siginfo, see kill_proc()
> > + */
> > + start = (index << PAGE_SHIFT) & ~(size - 1);
> > + unmap_mapping_range(mapping, start, size, 0);
> > + }
> > +
> > + kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
> > +}
> > +
> > +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > +int flags) {
> > + LIST_HEAD(to_kill);
> > + /* load the pfn of the dax mapping file */
> > + unsigned long pfn = dax_load_pfn(mapping, index);
> > +
> > + /*
> > + * Unlike System-RAM there is no possibility to swap in a
> > + * different physical page at a given virtual address, so all
> > + * userspace consumption of ZONE_DEVICE memory necessitates
> > + * SIGBUS (i.e. MF_MUST_KILL)
> > + */
> > + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > + collect_procs_file(pfn_to_page(pfn), mapping, index, &to_kill,
> > + flags & MF_ACTION_REQUIRED);
> > +
> > + unmap_and_kill(&to_kill, pfn, mapping, index, flags);
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > +
> > static int memory_failure_hugetlb(unsigned long pfn, int flags) {
> > struct page *p = pfn_to_page(pfn); @@ -1298,12 +1348,8 @@
> > static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> > struct dev_pagemap *pgmap) {
> > struct page *page = pfn_to_page(pfn);
> > - const bool unmap_success = true;
> > - unsigned long size = 0;
> > - struct to_kill *tk;
> > - LIST_HEAD(tokill);
> > + LIST_HEAD(to_kill);
> > int rc = -EBUSY;
> > - loff_t start;
> > dax_entry_t cookie;
> >
> > if (flags & MF_COUNT_INCREASED) @@ -1355,22 +1401,10 @@
> static
> > int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> > * SIGBUS (i.e. MF_MUST_KILL)
> > */
> > flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > - collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
> > + collect_procs_file(page, page->mapping, page->index, &to_kill,
> > + flags & MF_ACTION_REQUIRED);
> >
> > - list_for_each_entry(tk, &tokill, nd)
> > - if (tk->size_shift)
> > - size = max(size, 1UL << tk->size_shift);
> > - if (size) {
> > - /*
> > - * Unmap the largest mapping to avoid breaking up
> > - * device-dax mappings which are constant size. The
> > - * actual size of the mapping being torn down is
> > - * communicated in siginfo, see kill_proc()
> > - */
> > - start = (page->index << PAGE_SHIFT) & ~(size - 1);
> > - unmap_mapping_range(page->mapping, start, size, 0);
> > - }
> > - kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn,
> flags);
> > + unmap_and_kill(&to_kill, pfn, page->mapping, page->index,
> > + flags);
>
> There's just too much change in this patch and not enough justification of what
> is being refactored and why.
Sorry again for the mess...
--
Thanks,
Ruan.
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock
2021-06-17 7:04 ` Dan Williams
@ 2021-06-17 8:12 ` ruansy.fnst
0 siblings, 0 replies; 7+ messages in thread
From: ruansy.fnst @ 2021-06-17 8:12 UTC (permalink / raw)
To: Dan Williams
Cc: Linux Kernel Mailing List, linux-xfs, Linux MM, linux-fsdevel,
device-mapper development, Darrick J. Wong, david,
Christoph Hellwig, Alasdair Kergon, Mike Snitzer,
Goldwyn Rodrigues, Linux NVDIMM
> -----Original Message-----
> From: Dan Williams <dan.j.williams@intel.com>
> Subject: Re: [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock
>
> On Wed, Jun 16, 2021 at 11:51 PM ruansy.fnst@fujitsu.com
> <ruansy.fnst@fujitsu.com> wrote:
> >
> > > -----Original Message-----
> > > From: Dan Williams <dan.j.williams@intel.com>
> > > Subject: Re: [PATCH v4 03/10] fs: Introduce ->corrupted_range() for
> > > superblock
> > >
> > > [ drop old linux-nvdimm@lists.01.org, add nvdimm@lists.linux.dev ]
> > >
> > > On Thu, Jun 3, 2021 at 6:19 PM Shiyang Ruan <ruansy.fnst@fujitsu.com>
> wrote:
> > > >
> > > > Memory failure occurs in fsdax mode will finally be handled in
> > > > filesystem. We introduce this interface to find out files or
> > > > metadata affected by the corrupted range, and try to recover the
> > > > corrupted data if possiable.
> > > >
> > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > ---
> > > > include/linux/fs.h | 2 ++
> > > > 1 file changed, 2 insertions(+)
> > > >
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h index
> > > > c3c88fdb9b2a..92af36c4225f 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -2176,6 +2176,8 @@ struct super_operations {
> > > > struct shrink_control *);
> > > > long (*free_cached_objects)(struct super_block *,
> > > > struct shrink_control *);
> > > > + int (*corrupted_range)(struct super_block *sb, struct
> > > > + block_device
> > > *bdev,
> > > > + loff_t offset, size_t len, void
> > > > + *data);
> > >
> > > Why does the superblock need a new operation? Wouldn't whatever
> > > function is specified here just be specified to the dax_dev as the
> > > ->notify_failure() holder callback?
> >
> > Because we need to find out which file is effected by the given poison page so
> that memory-failure code can do collect_procs() and kill_procs() jobs. And it
> needs filesystem to use its rmap feature to search the file from a given offset.
> So, we need this implemented by the specified filesystem and called by
> dax_device's holder.
> >
> > This is the call trace I described in cover letter:
> > memory_failure()
> > * fsdax case
> > pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
> > dax_device->holder_ops->corrupted_range() =>
> > - fs_dax_corrupted_range()
> > - md_dax_corrupted_range()
> > sb->s_ops->currupted_range() => xfs_fs_corrupted_range() <==
> **HERE**
> > xfs_rmap_query_range()
> > xfs_currupt_helper()
> > * corrupted on metadata
> > try to recover data, call xfs_force_shutdown()
> > * corrupted on file data
> > try to recover data, call mf_dax_kill_procs()
> > * normal case
> > mf_generic_kill_procs()
> >
> > As you can see, this new added operation is an important for the whole
> progress.
>
> I don't think you need either fs_dax_corrupted_range() nor
> sb->s_ops->corrupted_range(). In fact that fs_dax_corrupted_range()
> looks broken because the filesystem may not even be mounted on the device
> associated with the error.
If filesystem is not mounted, then there won't be any process using the broken page and no one need to be killed in memory-failure. So, I think we can just return and handle the error on driver level if needed.
> The holder_data and holder_op should be sufficient
> from communicating the stack of notifications:
>
> pgmap->notify_memory_failure() => pmem_pgmap_notify_failure()
> pmem_dax_dev->holder_ops->notify_failure(pmem_dax_dev) =>
> md_dax_notify_failure()
> md_dax_dev->holder_ops->notify_failure() => xfs_notify_failure()
>
> I.e. the entire chain just walks dax_dev holder ops.
Oh, I see. Just need to implement holder_ops in filesystem or mapped_device directly. I made the routine complicated.
--
Thanks,
Ruan.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-06-17 8:14 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20210604011844.1756145-1-ruansy.fnst@fujitsu.com>
[not found] ` <20210604011844.1756145-4-ruansy.fnst@fujitsu.com>
2021-06-16 0:48 ` [PATCH v4 03/10] fs: Introduce ->corrupted_range() for superblock Dan Williams
2021-06-17 6:51 ` ruansy.fnst
2021-06-17 7:04 ` Dan Williams
2021-06-17 8:12 ` ruansy.fnst
[not found] ` <20210604011844.1756145-5-ruansy.fnst@fujitsu.com>
2021-06-16 6:30 ` [PATCH v4 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping Dan Williams
2021-06-17 7:51 ` ruansy.fnst
[not found] ` <20210604011844.1756145-6-ruansy.fnst@fujitsu.com>
2021-06-16 6:49 ` [PATCH v4 05/10] mm, pmem: Implement ->memory_failure() in pmem driver Dan Williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).