From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754549Ab3KOHJc (ORCPT ); Fri, 15 Nov 2013 02:09:32 -0500 Received: from e23smtp08.au.ibm.com ([202.81.31.141]:54176 "EHLO e23smtp08.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752422Ab3KOHJW (ORCPT ); Fri, 15 Nov 2013 02:09:22 -0500 Message-ID: <5285C899.6040005@linux.vnet.ibm.com> Date: Fri, 15 Nov 2013 15:09:13 +0800 From: Xiao Guangrong User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.0 MIME-Version: 1.0 To: Marcelo Tosatti CC: gleb@redhat.com, avi.kivity@gmail.com, pbonzini@redhat.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Subject: Re: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes References: <1382534973-13197-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com> <1382534973-13197-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com> <20131114003609.GA15692@amt.cnet> <52845C6C.6060708@linux.vnet.ibm.com> <20131114183959.GA27064@amt.cnet> In-Reply-To: <20131114183959.GA27064@amt.cnet> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13111507-5140-0000-0000-0000042A2269 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/15/2013 02:39 AM, Marcelo Tosatti wrote: > On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote: >> >> Hi Marcelo, >> >> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote: >> >>> >>> Any code location which reads the writable bit in the spte and assumes if its not >>> set, that the translation which the spte refers to is not cached in a >>> remote CPU's TLB can become buggy. (*) >>> >>> It might be the case that now its not an issue, but its so subtle that >>> it should be improved. >>> >>> Can you add a fat comment on top of is_writeable_bit describing this? >>> (and explain why is_writable_pte users do not make an assumption >>> about (*). >>> >>> "Writeable bit of locklessly modifiable sptes might be cleared >>> but TLBs not flushed: so whenever reading locklessly modifiable sptes >>> you cannot assume TLBs are flushed". >>> >>> For example this one is unclear: >>> >>> if (!can_unsync && is_writable_pte(*sptep)) >>> goto set_pte; >>> And: >>> >>> if (!is_writable_pte(spte) && >>> !(pt_protect && spte_is_locklessly_modifiable(spte))) >>> return false; >>> >>> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are >>> serialized by a single mutex (if there were two mutexes, it would not be >>> safe). Can you add an assert to both >>> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log >>> for (slots_lock) is locked, and explain? >>> >>> So just improve the comments please, thanks (no need to resend whole >>> series). >> >> Thank you very much for your time to review it and really appreciate >> for you detailed the issue so clearly to me. >> >> I will do it on the top of this patchset or after it is merged >> (if it's possiable). > > Ok, can you explain why every individual caller of is_writable_pte have > no such assumption now? (the one mentioned above is not clear to me for > example, should explain all of them). Okay. Generally speak, we 1) needn't care readonly spte too much since it can not be locklessly write-protected and 2) if is_writable_pte() is used to check mmu-mode's state we can check SPTE_MMU_WRITEABLE instead. There are the places is_writable_pte is used: 1) in spte_has_volatile_bits(): 527 static bool spte_has_volatile_bits(u64 spte) 528 { 529 /* 530 * Always atomicly update spte if it can be updated 531 * out of mmu-lock, it can ensure dirty bit is not lost, 532 * also, it can help us to get a stable is_writable_pte() 533 * to ensure tlb flush is not missed. 534 */ 535 if (spte_is_locklessly_modifiable(spte)) 536 return true; 537 538 if (!shadow_accessed_mask) 539 return false; 540 541 if (!is_shadow_present_pte(spte)) 542 return false; 543 544 if ((spte & shadow_accessed_mask) && 545 (!is_writable_pte(spte) || (spte & shadow_dirty_mask))) 546 return false; 547 548 return true; 549 } this path is not broken since any spte can be lockless modifiable will do lockless update (will always return 'true' in the line 536). 2): in mmu_spte_update() 594 /* 595 * For the spte updated out of mmu-lock is safe, since 596 * we always atomicly update it, see the comments in 597 * spte_has_volatile_bits(). 598 */ 599 if (spte_is_locklessly_modifiable(old_spte) && 600 !is_writable_pte(new_spte)) 601 ret = true; The new_spte is a temp value that can not be fetched by lockless write-protection and !is_writable_pte() is stable enough (can not be locklessly write-protected). 3) in spte_write_protect() 1368 if (!is_writable_pte(spte) && 1369 !spte_is_locklessly_modifiable(spte)) 1370 return false; 1371 It always do write-protection if the spte is lockelss modifiable. (This code is the aspect after applying the whole pachset, the code is safe too before patch "[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect" since the lockless write-protection path is serialized by a single lock.). 4) in set_spte() 2690 /* 2691 * Optimization: for pte sync, if spte was writable the hash 2692 * lookup is unnecessary (and expensive). Write protection 2693 * is responsibility of mmu_get_page / kvm_sync_page. 2694 * Same reasoning can be applied to dirty page accounting. 2695 */ 2696 if (!can_unsync && is_writable_pte(*sptep)) 2697 goto set_pte; It is used for a optimization and the worst case is the optimization is disabled (walking the shadow pages in the hast table) when the spte has been locklessly write-protected. It does not hurt anything since it is a rare event. And the optimization can be back if we check SPTE_MMU_WRITEABLE instead. 5) fast_page_fault() 3110 /* 3111 * Check if it is a spurious fault caused by TLB lazily flushed. 3112 * 3113 * Need not check the access of upper level table entries since 3114 * they are always ACC_ALL. 3115 */ 3116 if (is_writable_pte(spte)) { 3117 ret = true; 3118 goto exit; 3119 } Since kvm_vm_ioctl_get_dirty_log() firstly get-and-clear dirty-bitmap before do write-protect, the dirty-bitmap will be properly set again when fast_page_fault fix the spte who is write-protected by lockless write-protection. 6) in fast_page_fault's tracepoint: 244 #define __spte_satisfied(__spte) \ 245 (__entry->retry && is_writable_pte(__entry->__spte)) It causes the tracepoint reports the wrong result when fast_page_fault and tdp_page_fault/lockless-write-protect run concurrently, i guess it's okay since it's only used for trace. 7) in audit_write_protection(): 202 if (is_writable_pte(*sptep)) 203 audit_printk(kvm, "shadow page has writable " 204 "mappings: gfn %llx role %x\n", 205 sp->gfn, sp->role.word); It's okay since lockless-write-protection does not update the readonly sptes. > > OK to improve comments later. Thank you, Marcelo!