On Tue, May 17, 2016 at 11:02:48AM +0530, Bharata B Rao wrote: > On Mon, May 16, 2016 at 11:55 AM, Alexey Kardashevskiy wrote: > > On 05/13/2016 06:41 PM, Bharata B Rao wrote: > >> > >> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy > >> wrote: > > > > > >> > >>> + > >>> + avail = SPAPR_PCI_DMA_MAX_WINDOWS - > >>> spapr_phb_get_active_win_num(sphb); > >>> + > >>> + rtas_st(rets, 0, RTAS_OUT_SUCCESS); > >>> + rtas_st(rets, 1, avail); > >>> + rtas_st(rets, 2, max_window_size); > >>> + rtas_st(rets, 3, pgmask); > >>> + rtas_st(rets, 4, 0); /* DMA migration mask, not supported */ > >>> + > >>> + trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, > >>> pgmask); > >>> + return; > >>> + > >>> +param_error_exit: > >>> + rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR); > >>> +} > >>> + > >>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu, > >>> + sPAPRMachineState *spapr, > >>> + uint32_t token, uint32_t > >>> nargs, > >>> + target_ulong args, > >>> + uint32_t nret, target_ulong > >>> rets) > >>> +{ > >>> + sPAPRPHBState *sphb; > >>> + sPAPRTCETable *tcet = NULL; > >>> + uint32_t addr, page_shift, window_shift, liobn; > >>> + uint64_t buid; > >>> + > >>> + if ((nargs != 5) || (nret != 4)) { > >>> + goto param_error_exit; > >>> + } > >>> + > >>> + buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2); > >>> + addr = rtas_ld(args, 0); > >>> + sphb = spapr_pci_find_phb(spapr, buid); > >>> + if (!sphb || !sphb->ddw_enabled) { > >>> + goto param_error_exit; > >>> + } > >>> + > >>> + page_shift = rtas_ld(args, 3); > >>> + window_shift = rtas_ld(args, 4); > >> > >> > >> Kernel has a bug due to which wrong window_shift gets returned here. I > >> have posted possible fix here: > >> https://patchwork.ozlabs.org/patch/621497/ > >> > >> I have tried to work around this issue in QEMU too > >> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html > >> > >> But the above work around involves changing the memory representation > >> in DT. > > > > > > What is wrong with this workaround? > > The above workaround will result in different representations for > memory in DT before and after the workaround. > > Currently for -m 2G, -numa node,nodeid=0,mem=1G -numa > node,nodeid=1,mem=0.5G, we will have the following nodes in DT: > > memory@0 > memory@40000000 > ibm,dynamic-reconfiguration-memory > > ibm,dynamic-memory will have only DR LMBs: > > [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory > 0000000 0000 000a 0000 0000 8000 0000 8000 0008 > 0000010 0000 0000 ffff ffff 0000 0000 0000 0000 > 0000020 9000 0000 8000 0009 0000 0000 ffff ffff > 0000030 0000 0000 0000 0000 a000 0000 8000 000a > 0000040 0000 0000 ffff ffff 0000 0000 0000 0000 > 0000050 b000 0000 8000 000b 0000 0000 ffff ffff > 0000060 0000 0000 0000 0000 c000 0000 8000 000c > 0000070 0000 0000 ffff ffff 0000 0000 0000 0000 > 0000080 d000 0000 8000 000d 0000 0000 ffff ffff > 0000090 0000 0000 0000 0000 e000 0000 8000 000e > 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000 > 00000b0 f000 0000 8000 000f 0000 0000 ffff ffff > 00000c0 0000 0000 0000 0001 0000 0000 8000 0010 > 00000d0 0000 0000 ffff ffff 0000 0000 0000 0001 > 00000e0 1000 0000 8000 0011 0000 0000 ffff ffff > 00000f0 0000 0000 > > The memory region looks like this: > > memory-region: system > 0000000000000000-ffffffffffffffff (prio 0, RW): system > 0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram > 0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory > > After this workaround, all this will change like below: > > memory@0 > ibm,dynamic-reconfiguration-memory > > All LMBs in ibm,dynamic-memory: > > [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory > > 0000000 0000 0010 0000 0000 0000 0000 8000 0000 > 0000010 0000 0000 0000 0000 0000 0080 0000 0000 > 0000020 1000 0000 8000 0001 0000 0000 0000 0000 > 0000030 0000 0080 0000 0000 2000 0000 8000 0002 > 0000040 0000 0000 0000 0000 0000 0080 0000 0000 > 0000050 3000 0000 8000 0003 0000 0000 0000 0000 > 0000060 0000 0080 0000 0000 4000 0000 8000 0004 > 0000070 0000 0000 0000 0001 0000 0008 0000 0000 > 0000080 5000 0000 8000 0005 0000 0000 0000 0001 > 0000090 0000 0008 0000 0000 6000 0000 8000 0006 > 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000 > 00000b0 7000 0000 8000 0007 0000 0000 ffff ffff > 00000c0 0000 0000 0000 0000 8000 0000 8000 0008 > 00000d0 0000 0000 ffff ffff 0000 0000 0000 0000 > 00000e0 9000 0000 8000 0009 0000 0000 ffff ffff > 00000f0 0000 0000 0000 0000 a000 0000 8000 000a > 0000100 0000 0000 ffff ffff 0000 0000 0000 0000 > 0000110 b000 0000 8000 000b 0000 0000 ffff ffff > 0000120 0000 0000 0000 0000 c000 0000 8000 000c > 0000130 0000 0000 ffff ffff 0000 0000 0000 0000 > 0000140 d000 0000 8000 000d 0000 0000 ffff ffff > 0000150 0000 0000 0000 0000 e000 0000 8000 000e > 0000160 0000 0000 ffff ffff 0000 0000 0000 0000 > 0000170 f000 0000 8000 000f 0000 0000 ffff ffff > 0000180 0000 0000 > > Hotplug memory region gets a new address range now: > > memory-region: system > 0000000000000000-ffffffffffffffff (prio 0, RW): system > 0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram > 0000000060000000-00000000ffffffff (prio 0, RW): hotplug-memory > > > So when a guest that was booted with older QEMU is migrated to a newer > QEMU that has this workaround, then it will start exhibiting the above > changes after first reboot post migration. Ok.. why is that bad? > If user has done memory hotplug by explicitly specifying address in > the source, then even migration would fail because the addr specified > at the target will not be part of hotplug-memory range. Sorry, not really following the situation you're describing here. > Hence I believe we shoudn't workaround in this manner but have the > workaround in the DDW code where the window can be easily fixed. > > > > >> Hence I feel until the guest kernel changes are available, a > >> simpler work around would be to discard the window_shift value above > >> and recalculate the right value as below: > >> > >> if (machine->ram_size == machine->maxram_size) { > >> max_window_size = machine->ram_size; > >> } else { > >> MemoryHotplugState *hpms = &spapr->hotplug_memory; > >> max_window_size = hpms->base + memory_region_size(&hpms->mr); > >> } > >> window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT; > >> > >> and create DDW based on this calculated window_shift value. Does that > >> sound reasonable ? > > > > > > The workaround should only do that for the second window, at least, or for > > the default one but with page size at least 64K; otherwise it is going to be > > a waste of memory (2MB per each 1GB of guest RAM). > > Ok, will sync up with you separately to understand more about the > 'two' windows here. > > Regards, > Bharata. > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson