On Tue, May 17, 2016 at 11:02:48AM +0530, Bharata B Rao wrote:
> On Mon, May 16, 2016 at 11:55 AM, Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > On 05/13/2016 06:41 PM, Bharata B Rao wrote:
> >>
> >> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <aik@ozlabs.ru>
> >> wrote:
> >
> >
> >>
> >>> +
> >>> +    avail = SPAPR_PCI_DMA_MAX_WINDOWS -
> >>> spapr_phb_get_active_win_num(sphb);
> >>> +
> >>> +    rtas_st(rets, 0, RTAS_OUT_SUCCESS);
> >>> +    rtas_st(rets, 1, avail);
> >>> +    rtas_st(rets, 2, max_window_size);
> >>> +    rtas_st(rets, 3, pgmask);
> >>> +    rtas_st(rets, 4, 0); /* DMA migration mask, not supported */
> >>> +
> >>> +    trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size,
> >>> pgmask);
> >>> +    return;
> >>> +
> >>> +param_error_exit:
> >>> +    rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR);
> >>> +}
> >>> +
> >>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
> >>> +                                          sPAPRMachineState *spapr,
> >>> +                                          uint32_t token, uint32_t
> >>> nargs,
> >>> +                                          target_ulong args,
> >>> +                                          uint32_t nret, target_ulong
> >>> rets)
> >>> +{
> >>> +    sPAPRPHBState *sphb;
> >>> +    sPAPRTCETable *tcet = NULL;
> >>> +    uint32_t addr, page_shift, window_shift, liobn;
> >>> +    uint64_t buid;
> >>> +
> >>> +    if ((nargs != 5) || (nret != 4)) {
> >>> +        goto param_error_exit;
> >>> +    }
> >>> +
> >>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
> >>> +    addr = rtas_ld(args, 0);
> >>> +    sphb = spapr_pci_find_phb(spapr, buid);
> >>> +    if (!sphb || !sphb->ddw_enabled) {
> >>> +        goto param_error_exit;
> >>> +    }
> >>> +
> >>> +    page_shift = rtas_ld(args, 3);
> >>> +    window_shift = rtas_ld(args, 4);
> >>
> >>
> >> Kernel has a bug due to which wrong window_shift gets returned here. I
> >> have posted possible fix here:
> >> https://patchwork.ozlabs.org/patch/621497/
> >>
> >> I have tried to work around this issue in QEMU too
> >> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html
> >>
> >> But the above work around involves changing the memory representation
> >> in DT.
> >
> >
> > What is wrong with this workaround?
> 
> The above workaround will result in different representations for
> memory in DT before and after the workaround.
> 
> Currently for -m 2G, -numa node,nodeid=0,mem=1G -numa
> node,nodeid=1,mem=0.5G, we will have the following nodes in DT:
> 
> memory@0
> memory@40000000
> ibm,dynamic-reconfiguration-memory
> 
> ibm,dynamic-memory will have only DR LMBs:
> 
> [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
> 0000000 0000 000a 0000 0000 8000 0000 8000 0008
> 0000010 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000020 9000 0000 8000 0009 0000 0000 ffff ffff
> 0000030 0000 0000 0000 0000 a000 0000 8000 000a
> 0000040 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000050 b000 0000 8000 000b 0000 0000 ffff ffff
> 0000060 0000 0000 0000 0000 c000 0000 8000 000c
> 0000070 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000080 d000 0000 8000 000d 0000 0000 ffff ffff
> 0000090 0000 0000 0000 0000 e000 0000 8000 000e
> 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
> 00000b0 f000 0000 8000 000f 0000 0000 ffff ffff
> 00000c0 0000 0000 0000 0001 0000 0000 8000 0010
> 00000d0 0000 0000 ffff ffff 0000 0000 0000 0001
> 00000e0 1000 0000 8000 0011 0000 0000 ffff ffff
> 00000f0 0000 0000
> 
> The memory region looks like this:
> 
> memory-region: system
>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>     0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
>     0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory
> 
> After this workaround, all this will change like below:
> 
> memory@0
> ibm,dynamic-reconfiguration-memory
> 
> All LMBs in ibm,dynamic-memory:
> 
> [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory
> 
> 0000000 0000 0010 0000 0000 0000 0000 8000 0000
> 0000010 0000 0000 0000 0000 0000 0080 0000 0000
> 0000020 1000 0000 8000 0001 0000 0000 0000 0000
> 0000030 0000 0080 0000 0000 2000 0000 8000 0002
> 0000040 0000 0000 0000 0000 0000 0080 0000 0000
> 0000050 3000 0000 8000 0003 0000 0000 0000 0000
> 0000060 0000 0080 0000 0000 4000 0000 8000 0004
> 0000070 0000 0000 0000 0001 0000 0008 0000 0000
> 0000080 5000 0000 8000 0005 0000 0000 0000 0001
> 0000090 0000 0008 0000 0000 6000 0000 8000 0006
> 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000
> 00000b0 7000 0000 8000 0007 0000 0000 ffff ffff
> 00000c0 0000 0000 0000 0000 8000 0000 8000 0008
> 00000d0 0000 0000 ffff ffff 0000 0000 0000 0000
> 00000e0 9000 0000 8000 0009 0000 0000 ffff ffff
> 00000f0 0000 0000 0000 0000 a000 0000 8000 000a
> 0000100 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000110 b000 0000 8000 000b 0000 0000 ffff ffff
> 0000120 0000 0000 0000 0000 c000 0000 8000 000c
> 0000130 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000140 d000 0000 8000 000d 0000 0000 ffff ffff
> 0000150 0000 0000 0000 0000 e000 0000 8000 000e
> 0000160 0000 0000 ffff ffff 0000 0000 0000 0000
> 0000170 f000 0000 8000 000f 0000 0000 ffff ffff
> 0000180 0000 0000
> 
> Hotplug memory region gets a new address range now:
> 
> memory-region: system
>   0000000000000000-ffffffffffffffff (prio 0, RW): system
>     0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram
>     0000000060000000-00000000ffffffff (prio 0, RW): hotplug-memory
> 
> 
> So when a guest that was booted with older QEMU is migrated to a newer
> QEMU that has this workaround, then it will start exhibiting the above
> changes after first reboot post migration.

Ok.. why is that bad?

> If user has done memory hotplug by explicitly specifying address in
> the source, then even migration would fail because the addr specified
> at the target will not be part of hotplug-memory range.

Sorry, not really following the situation you're describing here.

> Hence I believe we shoudn't workaround in this manner but have the
> workaround in the DDW code where the window can be easily fixed.
> 
> >
> >> Hence I feel until the guest kernel changes are available, a
> >> simpler work around would be to discard the window_shift value above
> >> and recalculate the right value as below:
> >>
> >> if (machine->ram_size == machine->maxram_size) {
> >>     max_window_size = machine->ram_size;
> >> } else {
> >>      MemoryHotplugState *hpms = &spapr->hotplug_memory;
> >>      max_window_size = hpms->base + memory_region_size(&hpms->mr);
> >> }
> >> window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT;
> >>
> >> and create DDW based on this calculated window_shift value. Does that
> >> sound reasonable ?
> >
> >
> > The workaround should only do that for the second window, at least, or for
> > the default one but with page size at least 64K; otherwise it is going to be
> > a waste of memory (2MB per each 1GB of guest RAM).
> 
> Ok, will sync up with you separately to understand more about the
> 'two' windows here.
> 
> Regards,
> Bharata.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson