All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [kvm PATCH 1/2] mm: export __vmalloc_node_range()
       [not found] ` <20181020211200.255171-2-marcorr@google.com>
@ 2018-10-22 20:06   ` Konrad Rzeszutek Wilk
  2018-10-23 12:33     ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-10-22 20:06 UTC (permalink / raw)
  To: Marc Orr, linux-mm, akpm; +Cc: kvm, jmattson, rientjes

On Sat, Oct 20, 2018 at 02:11:59PM -0700, Marc Orr wrote:
> The __vmalloc_node_range() is in the include/linux/vmalloc.h file, but
> it's not exported so it can't be used. This patch exports the API. The
> motivation to export it is so that we can do aligned vmalloc's of KVM
> vcpus.

Would it make more sense to change it to not have __ in front of it?
Also you forgot to CC the linux-mm folks. Doing that for you.

> 
> Signed-off-by: Marc Orr <marcorr@google.com>
> ---
>  mm/vmalloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a728fc492557..9e7974ab1da4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1763,6 +1763,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  			  "vmalloc: allocation failure: %lu bytes", real_size);
>  	return NULL;
>  }
> +EXPORT_SYMBOL_GPL(__vmalloc_node_range);
>  
>  /**
>   *	__vmalloc_node  -  allocate virtually contiguous memory
> -- 
> 2.19.1.568.g152ad8e336-goog
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 1/2] mm: export __vmalloc_node_range()
  2018-10-22 20:06   ` [kvm PATCH 1/2] mm: export __vmalloc_node_range() Konrad Rzeszutek Wilk
@ 2018-10-23 12:33     ` Michal Hocko
  2018-10-23 21:10       ` Marc Orr
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2018-10-23 12:33 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Marc Orr, linux-mm, akpm, kvm, jmattson, rientjes

On Mon 22-10-18 16:06:17, Konrad Rzeszutek Wilk wrote:
> On Sat, Oct 20, 2018 at 02:11:59PM -0700, Marc Orr wrote:
> > The __vmalloc_node_range() is in the include/linux/vmalloc.h file, but
> > it's not exported so it can't be used. This patch exports the API. The
> > motivation to export it is so that we can do aligned vmalloc's of KVM
> > vcpus.
> 
> Would it make more sense to change it to not have __ in front of it?
> Also you forgot to CC the linux-mm folks. Doing that for you.

Please also add a user so that we can see how the symbol is actually
used with a short explanation why the existing API is not suitable.

> > 
> > Signed-off-by: Marc Orr <marcorr@google.com>
> > ---
> >  mm/vmalloc.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index a728fc492557..9e7974ab1da4 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1763,6 +1763,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >  			  "vmalloc: allocation failure: %lu bytes", real_size);
> >  	return NULL;
> >  }
> > +EXPORT_SYMBOL_GPL(__vmalloc_node_range);
> >  
> >  /**
> >   *	__vmalloc_node  -  allocate virtually contiguous memory
> > -- 
> > 2.19.1.568.g152ad8e336-goog
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 1/2] mm: export __vmalloc_node_range()
  2018-10-23 12:33     ` Michal Hocko
@ 2018-10-23 21:10       ` Marc Orr
  2018-10-24  6:16         ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Orr @ 2018-10-23 21:10 UTC (permalink / raw)
  To: mhocko; +Cc: konrad.wilk, linux-mm, akpm, kvm, Jim Mattson, David Rientjes

Ack. The user is the 2nd patch in this series, the kvm_intel module,
which uses this version of vmalloc() to allocate vcpus across
non-contiguous memory. I will cc everyone here on that 2nd patch for
context.
Thanks,
Marc

On Tue, Oct 23, 2018 at 8:33 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 22-10-18 16:06:17, Konrad Rzeszutek Wilk wrote:
> > On Sat, Oct 20, 2018 at 02:11:59PM -0700, Marc Orr wrote:
> > > The __vmalloc_node_range() is in the include/linux/vmalloc.h file, but
> > > it's not exported so it can't be used. This patch exports the API. The
> > > motivation to export it is so that we can do aligned vmalloc's of KVM
> > > vcpus.
> >
> > Would it make more sense to change it to not have __ in front of it?
> > Also you forgot to CC the linux-mm folks. Doing that for you.
>
> Please also add a user so that we can see how the symbol is actually
> used with a short explanation why the existing API is not suitable.
>
> > >
> > > Signed-off-by: Marc Orr <marcorr@google.com>
> > > ---
> > >  mm/vmalloc.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index a728fc492557..9e7974ab1da4 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -1763,6 +1763,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > >                       "vmalloc: allocation failure: %lu bytes", real_size);
> > >     return NULL;
> > >  }
> > > +EXPORT_SYMBOL_GPL(__vmalloc_node_range);
> > >
> > >  /**
> > >   * __vmalloc_node  -  allocate virtually contiguous memory
> > > --
> > > 2.19.1.568.g152ad8e336-goog
> > >
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 0/2] kvm: vmalloc vmx vcpus
       [not found] <20181020211200.255171-1-marcorr@google.com>
       [not found] ` <20181020211200.255171-2-marcorr@google.com>
@ 2018-10-23 21:13 ` Marc Orr
       [not found] ` <20181020211200.255171-3-marcorr@google.com>
  2 siblings, 0 replies; 12+ messages in thread
From: Marc Orr @ 2018-10-23 21:13 UTC (permalink / raw)
  To: kvm, Jim Mattson, David Rientjes, konrad.wilk, linux-mm, akpm

Adding everyone that's cc'd on [kvm PATCH 1/2] mm: export
__vmalloc_node_range().
Thanks,
MarcOn Sat, Oct 20, 2018 at 5:12 PM Marc Orr <marcorr@google.com> wrote:
>
> Patch series to allocate vmx vcpus with vmalloc() instead of kalloc().
> This enables vendors to pack more VMs on a single host.
>
> Marc Orr (2):
>   mm: export __vmalloc_node_range()
>   kvm: vmx: use vmalloc() to allocate vcpus
>
>  arch/x86/kvm/vmx.c  | 98 +++++++++++++++++++++++++++++++++++++++++----
>  mm/vmalloc.c        |  1 +
>  virt/kvm/kvm_main.c | 28 +++++++------
>  3 files changed, 107 insertions(+), 20 deletions(-)
>
> --
> 2.19.1.568.g152ad8e336-goog
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 2/2] kvm: vmx: use vmalloc() to allocate vcpus
       [not found] ` <20181020211200.255171-3-marcorr@google.com>
@ 2018-10-23 21:13   ` Marc Orr
  2018-10-24 11:41     ` Matthew Wilcox
  2018-10-24 22:31     ` Sean Christopherson
  0 siblings, 2 replies; 12+ messages in thread
From: Marc Orr @ 2018-10-23 21:13 UTC (permalink / raw)
  To: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk, linux-mm, akpm

Adding everyone that's cc'd on [kvm PATCH 1/2] mm: export
__vmalloc_node_range().
Thanks,
Marc
On Sat, Oct 20, 2018 at 5:13 PM Marc Orr <marcorr@google.com> wrote:
>
> Previously, vcpus were allocated through the kmem_cache_zalloc() API,
> which requires the underlying physical memory to be contiguous.
> Because the x86 vcpu struct, struct vcpu_vmx, is relatively large
> (e.g., currently 47680 bytes on my setup), it can become hard to find
> contiguous memory.
>
> At the same time, the comments in the code indicate that the primary
> reason for using the kmem_cache_zalloc() API is to align the memory
> rather than to provide physical contiguity.
>
> Thus, this patch updates the vcpu allocation logic for vmx to use the
> vmalloc() API.
>
> Signed-off-by: Marc Orr <marcorr@google.com>
> ---
>  arch/x86/kvm/vmx.c  | 98 +++++++++++++++++++++++++++++++++++++++++----
>  virt/kvm/kvm_main.c | 28 +++++++------
>  2 files changed, 106 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index abeeb45d1c33..d480a2cc0667 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -898,7 +898,14 @@ struct nested_vmx {
>  #define POSTED_INTR_ON  0
>  #define POSTED_INTR_SN  1
>
> -/* Posted-Interrupt Descriptor */
> +/*
> + * Posted-Interrupt Descriptor
> + *
> + * Note, the physical address of this structure is used by VMX. Furthermore, the
> + * translation code assumes that the entire pi_desc struct resides within a
> + * single page, which will be true because the struct is 64 bytes and 64-byte
> + * aligned.
> + */
>  struct pi_desc {
>         u32 pir[8];     /* Posted interrupt requested */
>         union {
> @@ -970,8 +977,25 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)
>
>  struct vmx_msrs {
>         unsigned int            nr;
> -       struct vmx_msr_entry    val[NR_AUTOLOAD_MSRS];
> +       struct vmx_msr_entry    *val;
>  };
> +struct kmem_cache *vmx_msr_entry_cache;
> +
> +/*
> + * To prevent vmx_msr_entry array from crossing a page boundary, require:
> + * sizeof(*vmx_msrs.vmx_msr_entry.val) to be a power of two. This is guaranteed
> + * through compile-time asserts that:
> + *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) is a power of two
> + *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) <= PAGE_SIZE
> + *   - The allocation of vmx_msrs.vmx_msr_entry.val is aligned to its size.
> + */
> +#define CHECK_POWER_OF_TWO(val) \
> +       BUILD_BUG_ON_MSG(!((val) && !((val) & ((val) - 1))), \
> +       #val " is not a power of two.")
> +#define CHECK_INTRA_PAGE(val) do { \
> +               CHECK_POWER_OF_TWO(val); \
> +               BUILD_BUG_ON(!(val <= PAGE_SIZE)); \
> +       } while (0)
>
>  struct vcpu_vmx {
>         struct kvm_vcpu       vcpu;
> @@ -6616,6 +6640,14 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
>         }
>
>         if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
> +               /*
> +                * Note, pi_desc is contained within a single
> +                * page because the struct is 64 bytes and 64-byte aligned.
> +                */
> +               phys_addr_t pi_desc_phys =
> +                       page_to_phys(vmalloc_to_page(&vmx->pi_desc)) +
> +                       (u64)&vmx->pi_desc % PAGE_SIZE;
> +
>                 vmcs_write64(EOI_EXIT_BITMAP0, 0);
>                 vmcs_write64(EOI_EXIT_BITMAP1, 0);
>                 vmcs_write64(EOI_EXIT_BITMAP2, 0);
> @@ -6624,7 +6656,7 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
>                 vmcs_write16(GUEST_INTR_STATUS, 0);
>
>                 vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
> -               vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +               vmcs_write64(POSTED_INTR_DESC_ADDR, pi_desc_phys);
>         }
>
>         if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
> @@ -11476,19 +11508,43 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>         free_loaded_vmcs(vmx->loaded_vmcs);
>         kfree(vmx->guest_msrs);
>         kvm_vcpu_uninit(vcpu);
> -       kmem_cache_free(kvm_vcpu_cache, vmx);
> +       kmem_cache_free(vmx_msr_entry_cache, vmx->msr_autoload.guest.val);
> +       kmem_cache_free(vmx_msr_entry_cache, vmx->msr_autoload.host.val);
> +       vfree(vmx);
>  }
>
>  static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
>  {
>         int err;
> -       struct vcpu_vmx *vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
> +       struct vcpu_vmx *vmx = __vmalloc_node_range(
> +                       sizeof(struct vcpu_vmx),
> +                       __alignof__(struct vcpu_vmx),
> +                       VMALLOC_START,
> +                       VMALLOC_END,
> +                       GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO | __GFP_ACCOUNT,
> +                       PAGE_KERNEL,
> +                       0,
> +                       NUMA_NO_NODE,
> +                       __builtin_return_address(0));
>         unsigned long *msr_bitmap;
>         int cpu;
>
>         if (!vmx)
>                 return ERR_PTR(-ENOMEM);
>
> +       vmx->msr_autoload.guest.val =
> +               kmem_cache_zalloc(vmx_msr_entry_cache, GFP_KERNEL);
> +       if (!vmx->msr_autoload.guest.val) {
> +               err = -ENOMEM;
> +               goto free_vmx;
> +       }
> +       vmx->msr_autoload.host.val =
> +               kmem_cache_zalloc(vmx_msr_entry_cache, GFP_KERNEL);
> +       if (!vmx->msr_autoload.host.val) {
> +               err = -ENOMEM;
> +               goto free_msr_autoload_guest;
> +       }
> +
>         vmx->vpid = allocate_vpid();
>
>         err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
> @@ -11576,7 +11632,11 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
>         kvm_vcpu_uninit(&vmx->vcpu);
>  free_vcpu:
>         free_vpid(vmx->vpid);
> -       kmem_cache_free(kvm_vcpu_cache, vmx);
> +       kmem_cache_free(vmx_msr_entry_cache, vmx->msr_autoload.host.val);
> +free_msr_autoload_guest:
> +       kmem_cache_free(vmx_msr_entry_cache, vmx->msr_autoload.guest.val);
> +free_vmx:
> +       vfree(vmx);
>         return ERR_PTR(err);
>  }
>
> @@ -15153,6 +15213,10 @@ module_exit(vmx_exit);
>  static int __init vmx_init(void)
>  {
>         int r;
> +       size_t vmx_msr_entry_size =
> +               sizeof(struct vmx_msr_entry) * NR_AUTOLOAD_MSRS;
> +
> +       CHECK_INTRA_PAGE(vmx_msr_entry_size);
>
>  #if IS_ENABLED(CONFIG_HYPERV)
>         /*
> @@ -15183,10 +15247,25 @@ static int __init vmx_init(void)
>         }
>  #endif
>
> -       r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
> -                    __alignof__(struct vcpu_vmx), THIS_MODULE);
> +       /*
> +        * Disable kmem cache; vmalloc will be used instead
> +        * to avoid OOM'ing when memory is available but not contiguous.
> +        */
> +       r = kvm_init(&vmx_x86_ops, 0, 0, THIS_MODULE);
>         if (r)
>                 return r;
> +       /*
> +        * A vmx_msr_entry array resides exclusively within the kernel. Thus,
> +        * use kmem_cache_create_usercopy(), with the usersize argument set to
> +        * ZERO, to blacklist copying vmx_msr_entry to/from user space.
> +        */
> +       vmx_msr_entry_cache =
> +               kmem_cache_create_usercopy("vmx_msr_entry", vmx_msr_entry_size,
> +                                 vmx_msr_entry_size, SLAB_ACCOUNT, 0, 0, NULL);
> +       if (!vmx_msr_entry_cache) {
> +               r = -ENOMEM;
> +               goto out;
> +       }
>
>         /*
>          * Must be called after kvm_init() so enable_ept is properly set
> @@ -15210,5 +15289,8 @@ static int __init vmx_init(void)
>         vmx_check_vmcs12_offsets();
>
>         return 0;
> +out:
> +       kvm_exit();
> +       return r;
>  }
>  module_init(vmx_init);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 786ade1843a2..8b979e7c3ecd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4038,18 +4038,22 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
>                 goto out_free_2;
>         register_reboot_notifier(&kvm_reboot_notifier);
>
> -       /* A kmem cache lets us meet the alignment requirements of fx_save. */
> -       if (!vcpu_align)
> -               vcpu_align = __alignof__(struct kvm_vcpu);
> -       kvm_vcpu_cache =
> -               kmem_cache_create_usercopy("kvm_vcpu", vcpu_size, vcpu_align,
> -                                          SLAB_ACCOUNT,
> -                                          offsetof(struct kvm_vcpu, arch),
> -                                          sizeof_field(struct kvm_vcpu, arch),
> -                                          NULL);
> -       if (!kvm_vcpu_cache) {
> -               r = -ENOMEM;
> -               goto out_free_3;
> +       /*
> +        * When vcpu_size is zero,
> +        * architecture-specific code manages its own vcpu allocation.
> +        */
> +       kvm_vcpu_cache = NULL;
> +       if (vcpu_size) {
> +               if (!vcpu_align)
> +                       vcpu_align = __alignof__(struct kvm_vcpu);
> +               kvm_vcpu_cache = kmem_cache_create_usercopy(
> +                       "kvm_vcpu", vcpu_size, vcpu_align, SLAB_ACCOUNT,
> +                       offsetof(struct kvm_vcpu, arch),
> +                       sizeof_field(struct kvm_vcpu, arch), NULL);
> +               if (!kvm_vcpu_cache) {
> +                       r = -ENOMEM;
> +                       goto out_free_3;
> +               }
>         }
>
>         r = kvm_async_pf_init();
> --
> 2.19.1.568.g152ad8e336-goog
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 1/2] mm: export __vmalloc_node_range()
  2018-10-23 21:10       ` Marc Orr
@ 2018-10-24  6:16         ` Michal Hocko
  2018-10-24  8:12           ` Marc Orr
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2018-10-24  6:16 UTC (permalink / raw)
  To: Marc Orr; +Cc: konrad.wilk, linux-mm, akpm, kvm, Jim Mattson, David Rientjes

On Tue 23-10-18 17:10:55, Marc Orr wrote:
> Ack. The user is the 2nd patch in this series, the kvm_intel module,
> which uses this version of vmalloc() to allocate vcpus across
> non-contiguous memory. I will cc everyone here on that 2nd patch for
> context.

Is there any reason to not fold those two into a single one?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 1/2] mm: export __vmalloc_node_range()
  2018-10-24  6:16         ` Michal Hocko
@ 2018-10-24  8:12           ` Marc Orr
  2018-10-24  8:22             ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Orr @ 2018-10-24  8:12 UTC (permalink / raw)
  To: mhocko
  Cc: Konrad Rzeszutek Wilk, linux-mm, akpm, kvm, Jim Mattson, David Rientjes

No. I separated them because they're going to two different subsystems
(i.e., mm and kvm). I'll fold them and resend the patch.
Thanks,
Marc
On Wed, Oct 24, 2018 at 7:16 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 23-10-18 17:10:55, Marc Orr wrote:
> > Ack. The user is the 2nd patch in this series, the kvm_intel module,
> > which uses this version of vmalloc() to allocate vcpus across
> > non-contiguous memory. I will cc everyone here on that 2nd patch for
> > context.
>
> Is there any reason to not fold those two into a single one?
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 1/2] mm: export __vmalloc_node_range()
  2018-10-24  8:12           ` Marc Orr
@ 2018-10-24  8:22             ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2018-10-24  8:22 UTC (permalink / raw)
  To: Marc Orr
  Cc: Konrad Rzeszutek Wilk, linux-mm, akpm, kvm, Jim Mattson, David Rientjes

Please do not top-post

On Wed 24-10-18 09:12:52, Marc Orr wrote:
> No. I separated them because they're going to two different subsystems
> (i.e., mm and kvm).

Yes, they do go to two different subsystems but they would have to
coordinate for the final merge because the later wouldn't work without
the former. So it is easier to have them in a single tree. From the
review POV it is better to have them in the single patch to see the
usecase for the export and judge whether this is the best option.

> I'll fold them and resend the patch.

Thanks!

> On Wed, Oct 24, 2018 at 7:16 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 23-10-18 17:10:55, Marc Orr wrote:
> > > Ack. The user is the 2nd patch in this series, the kvm_intel module,
> > > which uses this version of vmalloc() to allocate vcpus across
> > > non-contiguous memory. I will cc everyone here on that 2nd patch for
> > > context.
> >
> > Is there any reason to not fold those two into a single one?
> > --
> > Michal Hocko
> > SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 2/2] kvm: vmx: use vmalloc() to allocate vcpus
  2018-10-23 21:13   ` [kvm PATCH 2/2] kvm: vmx: use vmalloc() to allocate vcpus Marc Orr
@ 2018-10-24 11:41     ` Matthew Wilcox
  2018-10-24 18:05       ` Marc Orr
  2018-10-24 22:31     ` Sean Christopherson
  1 sibling, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2018-10-24 11:41 UTC (permalink / raw)
  To: Marc Orr
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk, linux-mm, akpm

On Tue, Oct 23, 2018 at 05:13:40PM -0400, Marc Orr wrote:
> > +       struct vcpu_vmx *vmx = __vmalloc_node_range(
> > +                       sizeof(struct vcpu_vmx),
> > +                       __alignof__(struct vcpu_vmx),
> > +                       VMALLOC_START,
> > +                       VMALLOC_END,
> > +                       GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO | __GFP_ACCOUNT,
> > +                       PAGE_KERNEL,
> > +                       0,
> > +                       NUMA_NO_NODE,
> > +                       __builtin_return_address(0));

I don't understand why you need to expose the lowest-level
__vmalloc_node_range to do what you need to do.

For example, __vmalloc_node would be easier for you to use while giving you
all the flexibility you think you want.

In fact, I don't think you need all the flexibility you're using.
vmalloc is always going to give you a page-aligned address, so
__alignof__(struct foo) isn't going to do anything worthwhile.

VMALLOC_START, VMALLOC_END, PAGE_KERNEL, 0, NUMA_NO_NODE, and
__builtin_return_address(0) are all the defaults.  So all you actually
need are these GFP flags.  __GFP_HIGHMEM isn't needed; vmalloc can
always allocate from highmem unless you're doing a DMA alloc.  So
it's just __GFP_ACCOUNT that's not supported by regular vzalloc().

I see __vmalloc_node_flags_caller is already non-static, so that would
be where I went and your call becomes simply:

	vmx = __vmalloc_node_flags_caller(sizeof(struct vcpu_vmx),
				NUMA_NO_NODE,
				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT,
				__builtin_return_address(0));

I suspect a better option might be to add a vzalloc_account() call
and then your code becomes:

	vmx = vzalloc_account(sizeof(struct vcpu_vmx));

while vmalloc gains:

void *vmalloc_account(unsigned long size)
{
	return __vmalloc_node_flags(size, NUMA_NO_NODE,
				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT);
}
EXPORT_SYMBOL(vmalloc_account);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 2/2] kvm: vmx: use vmalloc() to allocate vcpus
  2018-10-24 11:41     ` Matthew Wilcox
@ 2018-10-24 18:05       ` Marc Orr
  2018-10-25 12:58         ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Orr @ 2018-10-24 18:05 UTC (permalink / raw)
  To: willy
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk, linux-mm, akpm

On Wed, Oct 24, 2018 at 12:41 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Oct 23, 2018 at 05:13:40PM -0400, Marc Orr wrote:
> > > +       struct vcpu_vmx *vmx = __vmalloc_node_range(
> > > +                       sizeof(struct vcpu_vmx),
> > > +                       __alignof__(struct vcpu_vmx),
> > > +                       VMALLOC_START,
> > > +                       VMALLOC_END,
> > > +                       GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO | __GFP_ACCOUNT,
> > > +                       PAGE_KERNEL,
> > > +                       0,
> > > +                       NUMA_NO_NODE,
> > > +                       __builtin_return_address(0));
>
> I don't understand why you need to expose the lowest-level
> __vmalloc_node_range to do what you need to do.
>
> For example, __vmalloc_node would be easier for you to use while giving you
> all the flexibility you think you want.
>
> In fact, I don't think you need all the flexibility you're using.
> vmalloc is always going to give you a page-aligned address, so
> __alignof__(struct foo) isn't going to do anything worthwhile.
>
> VMALLOC_START, VMALLOC_END, PAGE_KERNEL, 0, NUMA_NO_NODE, and
> __builtin_return_address(0) are all the defaults.  So all you actually
> need are these GFP flags.  __GFP_HIGHMEM isn't needed; vmalloc can
> always allocate from highmem unless you're doing a DMA alloc.  So
> it's just __GFP_ACCOUNT that's not supported by regular vzalloc().
>
> I see __vmalloc_node_flags_caller is already non-static, so that would
> be where I went and your call becomes simply:
>
>         vmx = __vmalloc_node_flags_caller(sizeof(struct vcpu_vmx),
>                                 NUMA_NO_NODE,
>                                 GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT,
>                                 __builtin_return_address(0));
>
> I suspect a better option might be to add a vzalloc_account() call
> and then your code becomes:
>
>         vmx = vzalloc_account(sizeof(struct vcpu_vmx));
>
> while vmalloc gains:
>
> void *vmalloc_account(unsigned long size)
> {
>         return __vmalloc_node_flags(size, NUMA_NO_NODE,
>                                 GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT);
> }
> EXPORT_SYMBOL(vmalloc_account);

I 100% agree with this review! I only need the __GFP_ACCOUNT flag.
Actually, the first version of this patch that I developed downstream,
resembled what you're suggesting here. But I've never touched the mm
subsystem before, and we were not confident on how to shape the patch
for upstreaming, so that's how we ended up with this version, which
makes minimal changes to the mm subsystem.

Anyway, let me refactor the patch exactly as you've suggested in your
review, and send out a new version.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 2/2] kvm: vmx: use vmalloc() to allocate vcpus
  2018-10-23 21:13   ` [kvm PATCH 2/2] kvm: vmx: use vmalloc() to allocate vcpus Marc Orr
  2018-10-24 11:41     ` Matthew Wilcox
@ 2018-10-24 22:31     ` Sean Christopherson
  1 sibling, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2018-10-24 22:31 UTC (permalink / raw)
  To: Marc Orr
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk, linux-mm, akpm

On Tue, Oct 23, 2018 at 05:13:40PM -0400, Marc Orr wrote:
> Adding everyone that's cc'd on [kvm PATCH 1/2] mm: export
> __vmalloc_node_range().
> Thanks,
> Marc
> On Sat, Oct 20, 2018 at 5:13 PM Marc Orr <marcorr@google.com> wrote:
> >
> > Previously, vcpus were allocated through the kmem_cache_zalloc() API,
> > which requires the underlying physical memory to be contiguous.
> > Because the x86 vcpu struct, struct vcpu_vmx, is relatively large
> > (e.g., currently 47680 bytes on my setup), it can become hard to find
> > contiguous memory.
> >
> > At the same time, the comments in the code indicate that the primary
> > reason for using the kmem_cache_zalloc() API is to align the memory
> > rather than to provide physical contiguity.
> >
> > Thus, this patch updates the vcpu allocation logic for vmx to use the
> > vmalloc() API.
> >
> > Signed-off-by: Marc Orr <marcorr@google.com>
> > ---
> >  arch/x86/kvm/vmx.c  | 98 +++++++++++++++++++++++++++++++++++++++++----
> >  virt/kvm/kvm_main.c | 28 +++++++------
> >  2 files changed, 106 insertions(+), 20 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index abeeb45d1c33..d480a2cc0667 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -898,7 +898,14 @@ struct nested_vmx {
> >  #define POSTED_INTR_ON  0
> >  #define POSTED_INTR_SN  1
> >
> > -/* Posted-Interrupt Descriptor */
> > +/*
> > + * Posted-Interrupt Descriptor
> > + *
> > + * Note, the physical address of this structure is used by VMX. Furthermore, the
> > + * translation code assumes that the entire pi_desc struct resides within a
> > + * single page, which will be true because the struct is 64 bytes and 64-byte
> > + * aligned.
> > + */
> >  struct pi_desc {
> >         u32 pir[8];     /* Posted interrupt requested */
> >         union {
> > @@ -970,8 +977,25 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)
> >
> >  struct vmx_msrs {
> >         unsigned int            nr;
> > -       struct vmx_msr_entry    val[NR_AUTOLOAD_MSRS];
> > +       struct vmx_msr_entry    *val;
> >  };
> > +struct kmem_cache *vmx_msr_entry_cache;

The vmx_msr_entry changes should be done as a separate prereq patch,
e.g. for bisecting and/or revert in case there is a bug.  AFAICT they
don't depend on moving to vmalloc.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [kvm PATCH 2/2] kvm: vmx: use vmalloc() to allocate vcpus
  2018-10-24 18:05       ` Marc Orr
@ 2018-10-25 12:58         ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2018-10-25 12:58 UTC (permalink / raw)
  To: Marc Orr
  Cc: willy, kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk,
	linux-mm, akpm

On Wed 24-10-18 19:05:19, Marc Orr wrote:
> On Wed, Oct 24, 2018 at 12:41 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Oct 23, 2018 at 05:13:40PM -0400, Marc Orr wrote:
> > > > +       struct vcpu_vmx *vmx = __vmalloc_node_range(
> > > > +                       sizeof(struct vcpu_vmx),
> > > > +                       __alignof__(struct vcpu_vmx),
> > > > +                       VMALLOC_START,
> > > > +                       VMALLOC_END,
> > > > +                       GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO | __GFP_ACCOUNT,
> > > > +                       PAGE_KERNEL,
> > > > +                       0,
> > > > +                       NUMA_NO_NODE,
> > > > +                       __builtin_return_address(0));
> >
> > I don't understand why you need to expose the lowest-level
> > __vmalloc_node_range to do what you need to do.
> >
> > For example, __vmalloc_node would be easier for you to use while giving you
> > all the flexibility you think you want.
> >
> > In fact, I don't think you need all the flexibility you're using.
> > vmalloc is always going to give you a page-aligned address, so
> > __alignof__(struct foo) isn't going to do anything worthwhile.
> >
> > VMALLOC_START, VMALLOC_END, PAGE_KERNEL, 0, NUMA_NO_NODE, and
> > __builtin_return_address(0) are all the defaults.  So all you actually
> > need are these GFP flags.  __GFP_HIGHMEM isn't needed; vmalloc can
> > always allocate from highmem unless you're doing a DMA alloc.  So
> > it's just __GFP_ACCOUNT that's not supported by regular vzalloc().
> >
> > I see __vmalloc_node_flags_caller is already non-static, so that would
> > be where I went and your call becomes simply:
> >
> >         vmx = __vmalloc_node_flags_caller(sizeof(struct vcpu_vmx),
> >                                 NUMA_NO_NODE,
> >                                 GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT,
> >                                 __builtin_return_address(0));
> >
> > I suspect a better option might be to add a vzalloc_account() call
> > and then your code becomes:
> >
> >         vmx = vzalloc_account(sizeof(struct vcpu_vmx));
> >
> > while vmalloc gains:
> >
> > void *vmalloc_account(unsigned long size)
> > {
> >         return __vmalloc_node_flags(size, NUMA_NO_NODE,
> >                                 GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT);
> > }
> > EXPORT_SYMBOL(vmalloc_account);
> 
> I 100% agree with this review! I only need the __GFP_ACCOUNT flag.

__vmalloc is already exported. Can you use that instead?

> Actually, the first version of this patch that I developed downstream,
> resembled what you're suggesting here. But I've never touched the mm
> subsystem before, and we were not confident on how to shape the patch
> for upstreaming, so that's how we ended up with this version, which
> makes minimal changes to the mm subsystem.

And now you can see why there was a pushback to add the user of the
exported api in a single patch. It is much easier to review that way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-10-25 12:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20181020211200.255171-1-marcorr@google.com>
     [not found] ` <20181020211200.255171-2-marcorr@google.com>
2018-10-22 20:06   ` [kvm PATCH 1/2] mm: export __vmalloc_node_range() Konrad Rzeszutek Wilk
2018-10-23 12:33     ` Michal Hocko
2018-10-23 21:10       ` Marc Orr
2018-10-24  6:16         ` Michal Hocko
2018-10-24  8:12           ` Marc Orr
2018-10-24  8:22             ` Michal Hocko
2018-10-23 21:13 ` [kvm PATCH 0/2] kvm: vmalloc vmx vcpus Marc Orr
     [not found] ` <20181020211200.255171-3-marcorr@google.com>
2018-10-23 21:13   ` [kvm PATCH 2/2] kvm: vmx: use vmalloc() to allocate vcpus Marc Orr
2018-10-24 11:41     ` Matthew Wilcox
2018-10-24 18:05       ` Marc Orr
2018-10-25 12:58         ` Michal Hocko
2018-10-24 22:31     ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.