From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 0BD006B0005 for ; Mon, 29 Jan 2018 17:06:25 -0500 (EST) Received: by mail-pg0-f72.google.com with SMTP id t130so5787434pgb.4 for ; Mon, 29 Jan 2018 14:06:25 -0800 (PST) Received: from NAM03-BY2-obe.outbound.protection.outlook.com (mail-by2nam03on0130.outbound.protection.outlook.com. [104.47.42.130]) by mx.google.com with ESMTPS id q7-v6si10282283pls.349.2018.01.29.14.06.23 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 29 Jan 2018 14:06:23 -0800 (PST) From: "Zi Yan" Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move Date: Mon, 29 Jan 2018 17:06:14 -0500 Message-ID: <8ECFD324-D8A0-47DC-A6FD-B9F7D29445DC@cs.rutgers.edu> In-Reply-To: <20180103082555.14592-2-mhocko@kernel.org> References: <20180103082555.14592-1-mhocko@kernel.org> <20180103082555.14592-2-mhocko@kernel.org> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=_MailMate_20A6E0EC-1FE1-43CC-A91A-F9813CA0E004_="; micalg=pgp-sha512; protocol="application/pgp-signature" Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Andrew Morton , Naoya Horiguchi , "Kirill A. Shutemov" , Vlastimil Babka , Andrea Reale , Anshuman Khandual , linux-mm@kvack.org, LKML , Michal Hocko This is an OpenPGP/MIME signed message (RFC 3156 and 4880). --=_MailMate_20A6E0EC-1FE1-43CC-A91A-F9813CA0E004_= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Hi Michal, I discover that this patch does not hold mmap_sem while migrating pages i= n do_move_pages_to_node(). A simple fix below moves mmap_sem from add_page_for_migration() to the outmost do_pages_move(): diff --git a/mm/migrate.c b/mm/migrate.c index 5d0dc7b85f90..28b9e126cb38 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1487,7 +1487,6 @@ static int add_page_for_migration(struct mm_struct = *mm, unsigned long addr, unsigned int follflags; int err; - down_read(&mm->mmap_sem); err =3D -EFAULT; vma =3D find_vma(mm, addr); if (!vma || addr < vma->vm_start || !vma_migratable(vma)) @@ -1540,7 +1539,6 @@ static int add_page_for_migration(struct mm_struct = *mm, unsigned long addr, */ put_page(page); out: - up_read(&mm->mmap_sem); return err; } @@ -1561,6 +1559,7 @@ static int do_pages_move(struct mm_struct *mm, node= mask_t task_nodes, migrate_prep(); + down_read(&mm->mmap_sem); for (i =3D start =3D 0; i < nr_pages; i++) { const void __user *p; unsigned long addr; @@ -1628,6 +1627,7 @@ static int do_pages_move(struct mm_struct *mm, node= mask_t task_nodes, if (!err) err =3D err1; out: + up_read(&mm->mmap_sem); return err; } -- Best Regards Yan Zi On 3 Jan 2018, at 3:25, Michal Hocko wrote: > From: Michal Hocko > > do_pages_move is supposed to move user defined memory (an array of > addresses) to the user defined numa nodes (an array of nodes one for > each address). The user provided status array then contains resulting > numa node for each address or an error. The semantic of this function i= s > little bit confusing because only some errors are reported back. Notabl= y > migrate_pages error is only reported via the return value. This patch > doesn't try to address these semantic nuances but rather change the > underlying implementation. > > Currently we are processing user input (which can be really large) > in batches which are stored to a temporarily allocated page. Each > address is resolved to its struct page and stored to page_to_node > structure along with the requested target numa node. The array of these= > structures is then conveyed down the page migration path via private > argument. new_page_node then finds the corresponding structure and > allocates the proper target page. > > What is the problem with the current implementation and why to change > it? Apart from being quite ugly it also doesn't cope with unexpected > pages showing up on the migration list inside migrate_pages path. > That doesn't happen currently but the follow up patch would like to > make the thp migration code more clear and that would need to split a > THP into the list for some cases. > > How does the new implementation work? Well, instead of batching into a > fixed size array we simply batch all pages that should be migrated to > the same node and isolate all of them into a linked list which doesn't > require any additional storage. This should work reasonably well becaus= e > page migration usually migrates larger ranges of memory to a specific > node. So the common case should work equally well as the current > implementation. Even if somebody constructs an input where the target > numa nodes would be interleaved we shouldn't see a large performance > impact because page migration alone doesn't really benefit from > batching. mmap_sem batching for the lookup is quite questionable and > isolate_lru_page which would benefit from batching is not using it even= > in the current implementation. > > Acked-by: Kirill A. Shutemov > Signed-off-by: Michal Hocko > --- > mm/internal.h | 1 + > mm/mempolicy.c | 5 +- > mm/migrate.c | 306 +++++++++++++++++++++++++------------------------= -------- > 3 files changed, 138 insertions(+), 174 deletions(-) > > diff --git a/mm/internal.h b/mm/internal.h > index 3e5dc95dc259..745e247aca9c 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -540,4 +540,5 @@ static inline bool is_migrate_highatomic_page(struc= t page *page) > } > > void setup_zone_pageset(struct zone *zone); > +extern struct page *alloc_new_node_page(struct page *page, unsigned lo= ng node, int **x); > #endif /* __MM_INTERNAL_H */ > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index f604b22ebb65..66c9c79b21be 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -942,7 +942,8 @@ static void migrate_page_add(struct page *page, str= uct list_head *pagelist, > } > } > > -static struct page *new_node_page(struct page *page, unsigned long nod= e, int **x) > +/* page allocation callback for NUMA node migration */ > +struct page *alloc_new_node_page(struct page *page, unsigned long node= , int **x) > { > if (PageHuge(page)) > return alloc_huge_page_node(page_hstate(compound_head(page)), > @@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, in= t source, int dest, > flags | MPOL_MF_DISCONTIG_OK, &pagelist); > > if (!list_empty(&pagelist)) { > - err =3D migrate_pages(&pagelist, new_node_page, NULL, dest, > + err =3D migrate_pages(&pagelist, alloc_new_node_page, NULL, dest, > MIGRATE_SYNC, MR_SYSCALL); > if (err) > putback_movable_pages(&pagelist); > diff --git a/mm/migrate.c b/mm/migrate.c > index 4d0be47a322a..8fb90bcd44a7 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1444,141 +1444,103 @@ int migrate_pages(struct list_head *from, new= _page_t get_new_page, > } > > #ifdef CONFIG_NUMA > -/* > - * Move a list of individual pages > - */ > -struct page_to_node { > - unsigned long addr; > - struct page *page; > - int node; > - int status; > -}; > > -static struct page *new_page_node(struct page *p, unsigned long privat= e, > - int **result) > +static int store_status(int __user *status, int start, int value, int = nr) > { > - struct page_to_node *pm =3D (struct page_to_node *)private; > - > - while (pm->node !=3D MAX_NUMNODES && pm->page !=3D p) > - pm++; > + while (nr-- > 0) { > + if (put_user(value, status + start)) > + return -EFAULT; > + start++; > + } > > - if (pm->node =3D=3D MAX_NUMNODES) > - return NULL; > + return 0; > +} > > - *result =3D &pm->status; > +static int do_move_pages_to_node(struct mm_struct *mm, > + struct list_head *pagelist, int node) > +{ > + int err; > > - if (PageHuge(p)) > - return alloc_huge_page_node(page_hstate(compound_head(p)), > - pm->node); > - else if (thp_migration_supported() && PageTransHuge(p)) { > - struct page *thp; > + if (list_empty(pagelist)) > + return 0; > > - thp =3D alloc_pages_node(pm->node, > - (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM, > - HPAGE_PMD_ORDER); > - if (!thp) > - return NULL; > - prep_transhuge_page(thp); > - return thp; > - } else > - return __alloc_pages_node(pm->node, > - GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0); > + err =3D migrate_pages(pagelist, alloc_new_node_page, NULL, node, > + MIGRATE_SYNC, MR_SYSCALL); > + if (err) > + putback_movable_pages(pagelist); > + return err; > } > > /* > - * Move a set of pages as indicated in the pm array. The addr > - * field must be set to the virtual address of the page to be moved > - * and the node number must contain a valid target node. > - * The pm array ends with node =3D MAX_NUMNODES. > + * Resolves the given address to a struct page, isolates it from the L= RU and > + * puts it to the given pagelist. > + * Returns -errno if the page cannot be found/isolated or 0 when it ha= s been > + * queued or the page doesn't need to be migrated because it is alread= y on > + * the target node > */ > -static int do_move_page_to_node_array(struct mm_struct *mm, > - struct page_to_node *pm, > - int migrate_all) > +static int add_page_for_migration(struct mm_struct *mm, unsigned long = addr, > + int node, struct list_head *pagelist, bool migrate_all) > { > + struct vm_area_struct *vma; > + struct page *page; > + unsigned int follflags; > int err; > - struct page_to_node *pp; > - LIST_HEAD(pagelist); > > down_read(&mm->mmap_sem); > + err =3D -EFAULT; > + vma =3D find_vma(mm, addr); > + if (!vma || addr < vma->vm_start || !vma_migratable(vma)) > + goto out; > > - /* > - * Build a list of pages to migrate > - */ > - for (pp =3D pm; pp->node !=3D MAX_NUMNODES; pp++) { > - struct vm_area_struct *vma; > - struct page *page; > - struct page *head; > - unsigned int follflags; > - > - err =3D -EFAULT; > - vma =3D find_vma(mm, pp->addr); > - if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma)) > - goto set_status; > - > - /* FOLL_DUMP to ignore special (like zero) pages */ > - follflags =3D FOLL_GET | FOLL_DUMP; > - if (!thp_migration_supported()) > - follflags |=3D FOLL_SPLIT; > - page =3D follow_page(vma, pp->addr, follflags); > + /* FOLL_DUMP to ignore special (like zero) pages */ > + follflags =3D FOLL_GET | FOLL_DUMP; > + if (!thp_migration_supported()) > + follflags |=3D FOLL_SPLIT; > + page =3D follow_page(vma, addr, follflags); > > - err =3D PTR_ERR(page); > - if (IS_ERR(page)) > - goto set_status; > + err =3D PTR_ERR(page); > + if (IS_ERR(page)) > + goto out; > > - err =3D -ENOENT; > - if (!page) > - goto set_status; > + err =3D -ENOENT; > + if (!page) > + goto out; > > - err =3D page_to_nid(page); > + err =3D 0; > + if (page_to_nid(page) =3D=3D node) > + goto out_putpage; > > - if (err =3D=3D pp->node) > - /* > - * Node already in the right place > - */ > - goto put_and_set; > + err =3D -EACCES; > + if (page_mapcount(page) > 1 && !migrate_all) > + goto out_putpage; > > - err =3D -EACCES; > - if (page_mapcount(page) > 1 && > - !migrate_all) > - goto put_and_set; > - > - if (PageHuge(page)) { > - if (PageHead(page)) { > - isolate_huge_page(page, &pagelist); > - err =3D 0; > - pp->page =3D page; > - } > - goto put_and_set; > + if (PageHuge(page)) { > + if (PageHead(page)) { > + isolate_huge_page(page, pagelist); > + err =3D 0; > } > + } else { > + struct page *head; > > - pp->page =3D compound_head(page); > head =3D compound_head(page); > err =3D isolate_lru_page(head); > - if (!err) { > - list_add_tail(&head->lru, &pagelist); > - mod_node_page_state(page_pgdat(head), > - NR_ISOLATED_ANON + page_is_file_cache(head), > - hpage_nr_pages(head)); > - } > -put_and_set: > - /* > - * Either remove the duplicate refcount from > - * isolate_lru_page() or drop the page ref if it was > - * not isolated. > - */ > - put_page(page); > -set_status: > - pp->status =3D err; > - } > - > - err =3D 0; > - if (!list_empty(&pagelist)) { > - err =3D migrate_pages(&pagelist, new_page_node, NULL, > - (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL); > if (err) > - putback_movable_pages(&pagelist); > - } > + goto out_putpage; > > + err =3D 0; > + list_add_tail(&head->lru, pagelist); > + mod_node_page_state(page_pgdat(head), > + NR_ISOLATED_ANON + page_is_file_cache(head), > + hpage_nr_pages(head)); > + } > +out_putpage: > + /* > + * Either remove the duplicate refcount from > + * isolate_lru_page() or drop the page ref if it was > + * not isolated. > + */ > + put_page(page); > +out: > up_read(&mm->mmap_sem); > return err; > } > @@ -1593,79 +1555,79 @@ static int do_pages_move(struct mm_struct *mm, = nodemask_t task_nodes, > const int __user *nodes, > int __user *status, int flags) > { > - struct page_to_node *pm; > - unsigned long chunk_nr_pages; > - unsigned long chunk_start; > - int err; > - > - err =3D -ENOMEM; > - pm =3D (struct page_to_node *)__get_free_page(GFP_KERNEL); > - if (!pm) > - goto out; > + int current_node =3D NUMA_NO_NODE; > + LIST_HEAD(pagelist); > + int start, i; > + int err =3D 0, err1; > > migrate_prep(); > > - /* > - * Store a chunk of page_to_node array in a page, > - * but keep the last one as a marker > - */ > - chunk_nr_pages =3D (PAGE_SIZE / sizeof(struct page_to_node)) - 1; > - > - for (chunk_start =3D 0; > - chunk_start < nr_pages; > - chunk_start +=3D chunk_nr_pages) { > - int j; > + for (i =3D start =3D 0; i < nr_pages; i++) { > + const void __user *p; > + unsigned long addr; > + int node; > > - if (chunk_start + chunk_nr_pages > nr_pages) > - chunk_nr_pages =3D nr_pages - chunk_start; > - > - /* fill the chunk pm with addrs and nodes from user-space */ > - for (j =3D 0; j < chunk_nr_pages; j++) { > - const void __user *p; > - int node; > - > - err =3D -EFAULT; > - if (get_user(p, pages + j + chunk_start)) > - goto out_pm; > - pm[j].addr =3D (unsigned long) p; > - > - if (get_user(node, nodes + j + chunk_start)) > - goto out_pm; > - > - err =3D -ENODEV; > - if (node < 0 || node >=3D MAX_NUMNODES) > - goto out_pm; > - > - if (!node_state(node, N_MEMORY)) > - goto out_pm; > - > - err =3D -EACCES; > - if (!node_isset(node, task_nodes)) > - goto out_pm; > + err =3D -EFAULT; > + if (get_user(p, pages + i)) > + goto out_flush; > + if (get_user(node, nodes + i)) > + goto out_flush; > + addr =3D (unsigned long)p; > + > + err =3D -ENODEV; > + if (node < 0 || node >=3D MAX_NUMNODES) > + goto out_flush; > + if (!node_state(node, N_MEMORY)) > + goto out_flush; > > - pm[j].node =3D node; > + err =3D -EACCES; > + if (!node_isset(node, task_nodes)) > + goto out_flush; > + > + if (current_node =3D=3D NUMA_NO_NODE) { > + current_node =3D node; > + start =3D i; > + } else if (node !=3D current_node) { > + err =3D do_move_pages_to_node(mm, &pagelist, current_node); > + if (err) > + goto out; > + err =3D store_status(status, start, current_node, i - start); > + if (err) > + goto out; > + start =3D i; > + current_node =3D node; > } > > - /* End marker for this chunk */ > - pm[chunk_nr_pages].node =3D MAX_NUMNODES; > + /* > + * Errors in the page lookup or isolation are not fatal and we simpl= y > + * report them via status > + */ > + err =3D add_page_for_migration(mm, addr, current_node, > + &pagelist, flags & MPOL_MF_MOVE_ALL); > + if (!err) > + continue; > > - /* Migrate this chunk */ > - err =3D do_move_page_to_node_array(mm, pm, > - flags & MPOL_MF_MOVE_ALL); > - if (err < 0) > - goto out_pm; > + err =3D store_status(status, i, err, 1); > + if (err) > + goto out_flush; > > - /* Return status information */ > - for (j =3D 0; j < chunk_nr_pages; j++) > - if (put_user(pm[j].status, status + j + chunk_start)) { > - err =3D -EFAULT; > - goto out_pm; > - } > + err =3D do_move_pages_to_node(mm, &pagelist, current_node); > + if (err) > + goto out; > + if (i > start) { > + err =3D store_status(status, start, current_node, i - start); > + if (err) > + goto out; > + } > + current_node =3D NUMA_NO_NODE; > } > - err =3D 0; > - > -out_pm: > - free_page((unsigned long)pm); > +out_flush: > + /* Make sure we do not overwrite the existing error */ > + err1 =3D do_move_pages_to_node(mm, &pagelist, current_node); > + if (!err1) > + err1 =3D store_status(status, start, current_node, i - start); > + if (!err) > + err =3D err1; > out: > return err; > } > -- = > 2.15.1 --=_MailMate_20A6E0EC-1FE1-43CC-A91A-F9813CA0E004_= Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQEcBAEBCgAGBQJab5rWAAoJEEGLLxGcTqbMl2cIAKl+L5Ny63mpMl93chp1CsgF g6AX4/aZAdb9E6++dc6+dhYiIG0cpJTjG2jbYiNTSorwk8QN2C8uFOii+m7Sy6Kg hhVpi8XSznXS/SH7k/0WBpClQf3nHd9obHU3VFlHgRE3GY1hxy4JtoGZgrexstRX eY6iwbFiut0205lFT1VzA8W/YwGkok/2yEverZzqZVAVX41q8+qnZmMfRZkp6h6a PplX0GY+mbtTWxsCR1P3fa6aJTrPjIKshmHAQdw77HD2BnTu/BJrQ5diyFyqjT3e j4yB9iE81MBQ267NLpSSWiS2QMRrrj5VgTzfnj1shD7DoEpgz5KgPO2refy8IVY= =4UtN -----END PGP SIGNATURE----- --=_MailMate_20A6E0EC-1FE1-43CC-A91A-F9813CA0E004_=-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org