From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52030) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1erNQe-0007e1-NC for qemu-devel@nongnu.org; Thu, 01 Mar 2018 07:35:22 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1erNQb-0007sA-IP for qemu-devel@nongnu.org; Thu, 01 Mar 2018 07:35:20 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:39464) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1erNQb-0007rr-9s for qemu-devel@nongnu.org; Thu, 01 Mar 2018 07:35:17 -0500 Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w21CYMYj062689 for ; Thu, 1 Mar 2018 07:35:16 -0500 Received: from e06smtp14.uk.ibm.com (e06smtp14.uk.ibm.com [195.75.94.110]) by mx0a-001b2d01.pphosted.com with ESMTP id 2gednqj4b1-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Thu, 01 Mar 2018 07:35:15 -0500 Received: from localhost by e06smtp14.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 1 Mar 2018 12:35:12 -0000 References: <20180228195320.165230-1-borntraeger@de.ibm.com> <79f7059b-f2d3-a758-6bb9-29433b31b313@redhat.com> <20180301092442.GA2994@work-vm> <20180301114543.GC2994@work-vm> <69654fb2-f5ba-c23b-f6f5-1b559692cf37@de.ibm.com> <20180301122854.GD2994@work-vm> From: Christian Borntraeger Date: Thu, 1 Mar 2018 13:35:08 +0100 MIME-Version: 1.0 In-Reply-To: <20180301122854.GD2994@work-vm> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Message-Id: <416dbf71-c311-1f40-baa6-414cdf220f4d@de.ibm.com> Subject: Re: [Qemu-devel] [PATCH 1/1] s390/kvm: implement clearing part of IPL clear List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: Thomas Huth , qemu-s390x , qemu-devel , Cornelia Huck , David Hildenbrand , Halil Pasic , Janosch Frank , Paolo Bonzini On 03/01/2018 01:28 PM, Dr. David Alan Gilbert wrote: > * Christian Borntraeger (borntraeger@de.ibm.com) wrote: >> >> >> On 03/01/2018 12:45 PM, Dr. David Alan Gilbert wrote: >>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote: >>>> >>>> >>>> On 03/01/2018 10:24 AM, Dr. David Alan Gilbert wrote: >>>>> * Thomas Huth (thuth@redhat.com) wrote: >>>>>> On 28.02.2018 20:53, Christian Borntraeger wrote: >>>>>>> When a guests reboots with diagnose 308 subcode 3 it requests the memory >>>>>>> to be cleared. We did not do it so far. This does not only violate the >>>>>>> architecture, it also misses the chance to free up that memory on >>>>>>> reboot, which would help on host memory over commitment. By using >>>>>>> ram_block_discard_range we can cover both cases. >>>>>> >>>>>> Sounds like a good idea. I wonder whether that release_all_ram() >>>>>> function should maybe rather reside in exec.c, so that other machines >>>>>> that want to clear all RAM at reset time can use it, too? >>>>>> >>>>>>> Signed-off-by: Christian Borntraeger >>>>>>> --- >>>>>>> target/s390x/kvm.c | 19 +++++++++++++++++++ >>>>>>> 1 file changed, 19 insertions(+) >>>>>>> >>>>>>> diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c >>>>>>> index 8f3a422288..2e145ad5c3 100644 >>>>>>> --- a/target/s390x/kvm.c >>>>>>> +++ b/target/s390x/kvm.c >>>>>>> @@ -34,6 +34,8 @@ >>>>>>> #include "qapi/error.h" >>>>>>> #include "qemu/error-report.h" >>>>>>> #include "qemu/timer.h" >>>>>>> +#include "qemu/rcu_queue.h" >>>>>>> +#include "sysemu/cpus.h" >>>>>>> #include "sysemu/sysemu.h" >>>>>>> #include "sysemu/hw_accel.h" >>>>>>> #include "hw/boards.h" >>>>>>> @@ -41,6 +43,7 @@ >>>>>>> #include "sysemu/device_tree.h" >>>>>>> #include "exec/gdbstub.h" >>>>>>> #include "exec/address-spaces.h" >>>>>>> +#include "exec/ram_addr.h" >>>>>>> #include "trace.h" >>>>>>> #include "qapi-event.h" >>>>>>> #include "hw/s390x/s390-pci-inst.h" >>>>>>> @@ -1841,6 +1844,14 @@ static int kvm_arch_handle_debug_exit(S390CPU *cpu) >>>>>>> return ret; >>>>>>> } >>>>>>> >>>>>>> +static void release_all_rams(void) >>>>>> >>>>>> s/rams/ram/ maybe? >>>>>> >>>>>>> +{ >>>>>>> + struct RAMBlock *rb; >>>>>>> + >>>>>>> + QLIST_FOREACH_RCU(rb, &ram_list.blocks, next) >>>>>>> + ram_block_discard_range(rb, 0, rb->used_length); >>>>>> >>>>>> From a coding style point of view, I think there should be curly braces >>>>>> around ram_block_discard_range() ? >>>>> >>>>> I think this might break if it happens during a postcopy migrate. >>>>> The destination CPU is running, so it can do a reboot at just the wrong >>>>> time; and then the pages (that are protected by userfaultfd) would get >>>>> deallocated and trigger userfaultfd requests if accessed. >>>> >>>> Yes, userfaultd/postcopy is really fragile and relies on things that are not >>>> necessarily true (e.g. virito-balloon can also invalidate pages). >>> >>> That's why we use qemu_balloon_inhibit around postcopy to stop >>> ballooning; I'm not aware of anything else that does the same. >> >> we also have at least the pte_unused thing in mm/rmap.c that clearly >> predates userfaultfd. We might need to look into this as well.... > > I've not come across that; what does that do? It can drop a page on page out if the page is no longer of value. It is used by the CMMA (guest page hinting) code of s390x. see kernel mm/rmap.c static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { [...] } else if (pte_unused(pteval)) { /* * The guest indicated that the page content is of no * interest anymore. Simply discard the pte, vmscan * will take care of the rest. */ dec_mm_counter(mm, mm_counter(page)); /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); } else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { [...] > >>> >>>> The right thing here would be to actually terminate the postcopy migrate but >>>> return it as "successful" (since we are going to clear that RAM anyway). Do >>>> you see a good way to achieve that? >>> >>> There's no current mechanism to do it; I think it would have to involve >>> some interaction with the source as well though to tell it that you >>> didn't need that area of RAM anyway. >>> >>> However, there are more problems: >>> a) Even forgetting the userfault problem, this is racy since during >>> postcopy you're still receiving blocks from the source at the same time; >>> so some of the area that you've discarded might get overwritten by data >>> from the source. >> >> So how do you handle the case when the target system writes to memory >> that is still in flight? Can we build on that mechanism? > > Once we've entered postcopy, a page is basically in one of two states: > a) Not yet received - i.e. marked absent with MADV_DONTNEED; if the > guest tries to write to it then it'll block with userfault and ask the > source for the page; so the write wont happen until the page arrives. > b) Received - we've already got the page from the source; the source > never resends a page (once in postcopy) so now the destination can just > write to the page. > > Once in postcopy, a page is received at most once (i.e. if it's not > been received during precopy). > > I can imagine two ways of curing it: > a) Simple but slow; just read all the pages before doing the > discard, this forces it to wait for the pages to be received. > b) More complex but fast; Add a message on the return path to the > source telling it that you're going to discard a range; the source then > marks it's notes as cleared for those pages and then sends some form of > ack, and at that point you drop it. this looks like the most promising approach, but some work. > > A 3rd; incomplete way; would be just to drop the userfaultfd on the > destination for the RAMBlocks that are being cleared; but this does > leave the source state in a bit of a mess. > > >>> b) Your release_all_rams seems to do all RAM Blocks - won't that nuke >>> any ROMs as well? Or maybe even flash? >> >> ROMs loaded with load_elf (like our s390-ccw.img) are reloaded on every reset. >> See rom_reset in /hw/core/loader.c > > Ah, so this is happening after your reset code you've added? Yes, I am stopping all CPU, clear the memory. And then I call system_reset.