All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/22] KVM: optimize for MMIO handled
@ 2011-06-22 14:27 Xiao Guangrong
  2011-06-22 14:28 ` [PATCH v2 01/22] KVM: MMU: fix walking shadow page table Xiao Guangrong
                   ` (22 more replies)
  0 siblings, 23 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

[-- Attachment #1: Type: text/plain, Size: 2091 bytes --]

In this version, we fix the bugs in the v1:
- fix broken read emulation spans a page boundary
- fix invalid spte point is got if we walk shadow page table
  out of the mmu lock

And, we also introduce some rules to modify spte in this version,
then it does not need to atomically clear/set spte on x86_32 host
anymore, the performance report of x86_32 host is in the later section

Avi,

I have sampled the operation of lockless shadow page walking as below steps:
- mark walk_shadow_page_get_mmio_spte as 'noinline'
- do the netperf test, the client is on the guest(NIC is e1000) and the server
  is on the host, it can generate large press of mmio access
- using perf to sample it, and the result of 'perf report' is attached

The ratio of walk_shadow_page_get_mmio_spte is 0.09%, the ratio of handle_ept_misconfig
is 0.11%, the ratio of handle_mmio_page_fault_common is 0.07%

I think it is acceptable, your opinion? 

The structure of this patchset:
- Patch 1 ~ patch 3, fix the bugs in KVM
- Patch 4 ~ patch 7, cleanup read/write emulation and cache mmio info for them
  to quickly mmio emulate
- Patch 8 ~ patch 13, cleanup/prepare for mmio page fault support
- Patch 14 ~ patch 18, optimize for spte operation for x86_32 host
- Patch 19 ~ patch 22, implement mmio page fault

Performance report on X86_32 host:

Netperf (TCP_RR, NIC=e1000):
===========================
ept is enabled:

      Before         After
1st   677.28         697.58
2nd   670.93         703.94
3rd   677.19         692.17

ept is disabled

      Before         After
1st   648.06         725.54
2nd   644.71         729.56
3rd   650.80         724.67

Kernbech (do not redirect the output)
==========================
ept is enabled:

      Before         After
1st   1m29.525s      1m26.904s
2nd   1m27.101s      1m26.423s
3rd   1m25.552s      1m26.070s

ept is disabled

      Before         After
1st   3m15.126s      3m11.754s
2nd   3m16.090s      3m10.382s
3rd   3m14.428s      3m9.056s

By the way, you can find the performance report of x86_64 host in the
v1 version:
https://lkml.org/lkml/2011/6/7/190

[-- Attachment #2: netperf-log --]
[-- Type: text/plain, Size: 70186 bytes --]

# Events: 97K cycles
#
# Overhead   Command                 Shared Object                                      Symbol
# ........  ........  ............................  ..........................................
#
    11.50%  qemu-kvm  [kernel.kallsyms]             [k] __lock_acquire
     3.18%  qemu-kvm  qemu-kvm                      [.] 0x2ef61         
     2.90%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     2.68%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     2.57%  qemu-kvm  [kernel.kallsyms]             [k] lock_release
     2.28%  qemu-kvm  [kernel.kallsyms]             [k] check_chain_key
     2.25%  qemu-kvm  [kernel.kallsyms]             [k] trace_hardirqs_off_caller
     2.14%  qemu-kvm  [kernel.kallsyms]             [k] hlock_class
     2.01%  qemu-kvm  [kernel.kallsyms]             [k] check_flags
     2.00%  qemu-kvm  [kernel.kallsyms]             [k] native_read_tsc
     1.82%  qemu-kvm  [kernel.kallsyms]             [k] look_up_lock_class
     1.80%  qemu-kvm  [kernel.kallsyms]             [k] mark_lock
     1.67%  qemu-kvm  [kernel.kallsyms]             [k] lock_acquire
     1.63%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     1.56%  qemu-kvm  [kernel.kallsyms]             [k] lock_is_held
     1.55%  qemu-kvm  [kernel.kallsyms]             [k] match_held_lock
     1.48%  qemu-kvm  [kvm_intel]                   [k] vmx_vcpu_run
     1.30%  qemu-kvm  [kvm]                         [k] paging64_walk_addr_generic
     1.23%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_save
     1.19%  qemu-kvm  [kernel.kallsyms]             [k] local_clock
     1.12%  qemu-kvm  [kernel.kallsyms]             [k] lock_release_holdtime.part.7
     1.08%  qemu-kvm  [kernel.kallsyms]             [k] native_write_msr_safe
     0.90%  qemu-kvm  [kernel.kallsyms]             [k] lock_time_inc
     0.89%  qemu-kvm  [kernel.kallsyms]             [k] trace_hardirqs_off
     0.88%  qemu-kvm  [kernel.kallsyms]             [k] lock_acquired
     0.85%  qemu-kvm  [kernel.kallsyms]             [k] trace_hardirqs_on_caller
     0.84%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     0.81%  qemu-kvm  [kernel.kallsyms]             [k] __cycles_2_ns
     0.79%  qemu-kvm  [kernel.kallsyms]             [k] sched_clock_cpu
     0.71%  qemu-kvm  [kvm]                         [k] kvm_arch_vcpu_ioctl_run
     0.71%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_save
     0.70%  qemu-kvm  [kernel.kallsyms]             [k] mark_held_locks
     0.67%  qemu-kvm  [kvm]                         [k] x86_decode_insn
     0.63%  qemu-kvm  [kernel.kallsyms]             [k] native_sched_clock
     0.61%  qemu-kvm  [kernel.kallsyms]             [k] lock_release_non_nested
     0.60%  qemu-kvm  [kvm]                         [k] x86_emulate_insn
     0.57%  qemu-kvm  [kernel.kallsyms]             [k] system_call
     0.56%  qemu-kvm  [kernel.kallsyms]             [k] debug_lockdep_rcu_enabled
     0.51%  qemu-kvm  [kernel.kallsyms]             [k] hpet_next_event
     0.48%  qemu-kvm  [kernel.kallsyms]             [k] try_to_wake_up
     0.45%  qemu-kvm  [kernel.kallsyms]             [k] effective_load
     0.42%  qemu-kvm  [kernel.kallsyms]             [k] might_fault
     0.38%  qemu-kvm  [kernel.kallsyms]             [k] do_raw_spin_trylock
     0.37%  qemu-kvm  [kernel.kallsyms]             [k] register_lock_class
     0.37%  qemu-kvm  [kernel.kallsyms]             [k] update_cfs_shares
     0.36%  qemu-kvm  [kernel.kallsyms]             [k] copy_user_generic_string
     0.36%  qemu-kvm  [kernel.kallsyms]             [k] __might_sleep
     0.35%  qemu-kvm  [kvm]                         [k] x86_emulate_instruction
     0.34%  qemu-kvm  libc-2.14.so                  [.] __GI_ioctl
     0.34%  qemu-kvm  [kernel.kallsyms]             [k] sched_clock
     0.32%  qemu-kvm  [kernel.kallsyms]             [k] fget_light
     0.32%  qemu-kvm  [kernel.kallsyms]             [k] native_read_msr_safe
     0.31%  qemu-kvm  [kvm_intel]                   [k] vmx_handle_exit
     0.29%  qemu-kvm  [kernel.kallsyms]             [k] do_select
     0.29%  qemu-kvm  [kvm]                         [k] gfn_to_memslot
     0.28%  qemu-kvm  [kvm_intel]                   [k] vmcs_writel
     0.28%  qemu-kvm  [kvm]                         [k] kvm_memslots
     0.28%  qemu-kvm  [kvm]                         [k] do_insn_fetch
     0.27%  qemu-kvm  [kernel.kallsyms]             [k] hash_64
     0.27%  qemu-kvm  [kvm]                         [k] kvm_check_request
     0.27%  qemu-kvm  [kvm]                         [k] init_emulate_ctxt
     0.23%  qemu-kvm  [kernel.kallsyms]             [k] restore
     0.23%  qemu-kvm  [kernel.kallsyms]             [k] paravirt_read_tsc
     0.23%  qemu-kvm  [kernel.kallsyms]             [k] update_curr
     0.23%  qemu-kvm  [kernel.kallsyms]             [k] select_task_rq_fair
     0.22%  qemu-kvm  [kernel.kallsyms]             [k] do_vfs_ioctl
     0.21%  qemu-kvm  [kvm_intel]                   [k] vmx_save_host_state
     0.21%  qemu-kvm  [kvm]                         [k] emulator_read_write_onepage
     0.21%  qemu-kvm  [kvm]                         [k] __linearize
     0.20%  qemu-kvm  [kernel.kallsyms]             [k] fput
     0.20%  qemu-kvm  [kvm]                         [k] find_highest_vector
     0.20%  qemu-kvm  [kernel.kallsyms]             [k] audit_syscall_entry
     0.19%  qemu-kvm  [kvm]                         [k] decode_modrm
     0.19%  qemu-kvm  [kernel.kallsyms]             [k] schedule
     0.19%  qemu-kvm  [kernel.kallsyms]             [k] check_bytes_and_report
     0.19%  qemu-kvm  [kvm_intel]                   [k] vmx_get_msr
     0.17%  qemu-kvm  [kernel.kallsyms]             [k] audit_syscall_exit
     0.17%  qemu-kvm  libc-2.14.so                  [.] __memmove_ssse3_back
     0.17%  qemu-kvm  [kvm_intel]                   [k] vmx_vcpu_load
     0.17%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     0.17%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
     0.16%  qemu-kvm  [kernel.kallsyms]             [k] fcheck_files
     0.16%  qemu-kvm  libpthread-2.14.so            [.] pthread_mutex_lock
     0.16%  qemu-kvm  [kernel.kallsyms]             [k] unroll_tree_refs
     0.15%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_unlock_irqrestore
     0.15%  qemu-kvm  [kernel.kallsyms]             [k] __mutex_lock_common
     0.15%  qemu-kvm  [kernel.kallsyms]             [k] cpuacct_charge
     0.15%  qemu-kvm  [kernel.kallsyms]             [k] do_raw_spin_unlock
     0.15%  qemu-kvm  [kernel.kallsyms]             [k] dequeue_task_fair
     0.15%  qemu-kvm  [kvm]                         [k] kvm_apic_accept_pic_intr
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] lockdep_sys_exit_thunk
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] __list_del_entry
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] trace_hardirqs_on
     0.14%  qemu-kvm  libpthread-2.14.so            [.] __pthread_mutex_unlock
     0.14%  qemu-kvm  [kvm]                         [k] kvm_vcpu_ioctl
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] audit_get_context
     0.14%  qemu-kvm  [kvm]                         [k] kvm_io_bus_read
     0.14%  qemu-kvm  [kvm]                         [k] kvm_read_guest_virt_helper
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] native_load_tr_desc
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] ktime_get
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] update_cfs_load
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] __srcu_read_lock
     0.14%  qemu-kvm  [kernel.kallsyms]             [k] __perf_event_task_sched_out
     0.13%  qemu-kvm  [kvm]                         [k] kvm_register_read
     0.13%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_lock_irq
     0.13%  qemu-kvm  [kernel.kallsyms]             [k] timerqueue_add
     0.13%  qemu-kvm  [kernel.kallsyms]             [k] atomic64_add_unless.constprop.1
     0.13%  qemu-kvm  [kvm]                         [k] paging64_gpte_access
     0.13%  qemu-kvm  [kernel.kallsyms]             [k] trace_hardirqs_off_thunk
     0.13%  qemu-kvm  [kernel.kallsyms]             [k] _cond_resched
     0.12%  qemu-kvm  [kernel.kallsyms]             [k] tcp_poll
     0.12%  qemu-kvm  [kernel.kallsyms]             [k] need_resched
     0.12%  qemu-kvm  [kvm]                         [k] segmented_read
     0.12%  qemu-kvm  [kernel.kallsyms]             [k] audit_free_names
     0.12%  qemu-kvm  [kernel.kallsyms]             [k] trace_hardirqs_on_thunk
     0.12%  qemu-kvm  [kvm]                         [k] paging64_gva_to_gpa
     0.12%  qemu-kvm  [kvm]                         [k] apic_update_ppr
     0.12%  qemu-kvm  [kernel.kallsyms]             [k] generic_exec_single
     0.12%  qemu-kvm  [kernel.kallsyms]             [k] should_resched
     0.11%  qemu-kvm  [kvm_intel]                   [k] handle_ept_misconfig
     0.11%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.11%  qemu-kvm  [kvm]                         [k] gfn_to_hva
     0.11%  qemu-kvm  [kernel.kallsyms]             [k] path_put
     0.11%  qemu-kvm  [kernel.kallsyms]             [k] account_entity_dequeue
     0.11%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     0.11%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_lock
     0.11%  qemu-kvm  [kvm]                         [k] kvm_set_shared_msr
     0.11%  qemu-kvm  [kvm]                         [k] emulator_read_write
     0.11%  qemu-kvm  [kernel.kallsyms]             [k] nf_iterate
     0.11%  qemu-kvm  [kvm]                         [k] emulator_check_intercept
     0.10%  qemu-kvm  [kvm]                         [k] gfn_to_hva_many
     0.10%  qemu-kvm  [kvm_intel]                   [k] vmx_segment_cache_test_set
     0.10%  qemu-kvm  [kernel.kallsyms]             [k] calc_delta_mine
     0.10%  qemu-kvm  [kvm_intel]                   [k] __vmx_load_host_state.part.2
     0.10%  qemu-kvm  [kernel.kallsyms]             [k] kfree
     0.10%  qemu-kvm  [kernel.kallsyms]             [k] __hrtimer_start_range_ns
     0.10%  qemu-kvm  [kvm]                         [k] kvm_set_irq
     0.10%  qemu-kvm  [kernel.kallsyms]             [k] sys_ioctl
     0.10%  qemu-kvm  [kvm]                         [k] kvm_cpu_has_interrupt
     0.10%  qemu-kvm  [kernel.kallsyms]             [k] lockdep_sys_exit
     0.10%  qemu-kvm  [nf_conntrack]                [k] tcp_packet
     0.10%  qemu-kvm  [kvm]                         [k] kvm_arch_vcpu_load
     0.10%  qemu-kvm  [kvm]                         [k] decode_register_operand
     0.10%  qemu-kvm  [kernel.kallsyms]             [k] __list_add
     0.10%  qemu-kvm  [kvm]                         [k] apic_enabled
     0.10%  qemu-kvm  [kvm_intel]                   [k] vmx_cache_reg
     0.10%  qemu-kvm  [kvm]                         [k] kvm_apic_has_interrupt
     0.10%  qemu-kvm  [kvm]                         [k] trace_kvm_mmu_pagetable_walk
     0.10%  qemu-kvm  [kernel.kallsyms]             [k] slab_pad_check.part.7
     0.09%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_disable
     0.09%  qemu-kvm  [kvm]                         [k] write_mmio
     0.09%  qemu-kvm  [kvm_intel]                   [k] vmx_decache_cr3
     0.09%  qemu-kvm  [kernel.kallsyms]             [k] rb_erase
     0.09%  qemu-kvm  [ip_tables]                   [k] ipt_do_table
     0.09%  qemu-kvm  [kvm_intel]                   [k] vmx_get_cpl
     0.09%  qemu-kvm  [kvm]                         [k] get_cr3
     0.09%  qemu-kvm  [kvm]                         [k] walk_shadow_page_get_mmio_spte
     0.09%  qemu-kvm  [kernel.kallsyms]             [k] update_min_vruntime
     0.09%  qemu-kvm  [kvm_intel]                   [k] vmx_read_guest_seg_ar
     0.09%  qemu-kvm  [kvm]                         [k] translate_gpa
     0.09%  qemu-kvm  libpthread-2.14.so            [.] __lll_lock_wait
     0.09%  qemu-kvm  [kvm_intel]                   [k] cpu_has_virtual_nmis
     0.09%  qemu-kvm  [kernel.kallsyms]             [k] sys_futex
     0.09%  qemu-kvm  [kernel.kallsyms]             [k] dput
     0.08%  qemu-kvm  [kvm]                         [k] kvm_arch_vm_ioctl
     0.08%  qemu-kvm  [kvm]                         [k] srcu_read_lock
     0.08%  qemu-kvm  [kvm]                         [k] kvm_fetch_guest_virt
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] device_not_available
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] put_prev_task_fair
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] get_futex_key
     0.08%  qemu-kvm  [kernel.kallsyms].vsyscall_0  [.] 0xffffffffff60010c
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] task_group
     0.08%  qemu-kvm  [kvm]                         [k] kvm_vm_ioctl
     0.08%  qemu-kvm  [kvm]                         [k] kvm_lapic_sync_to_vapic
     0.08%  qemu-kvm  [kvm_intel]                   [k] vmx_get_interrupt_shadow
     0.08%  qemu-kvm  [kvm]                         [k] kvm_on_user_return
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] delay_tsc
     0.08%  qemu-kvm  [kvm]                         [k] paging64_gpte_to_gfn_lvl
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] init_object
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] fire_user_return_notifiers
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] current_kernel_time
     0.08%  qemu-kvm  [kvm]                         [k] check_write_user_access
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] clear_ti_thread_flag.constprop.15
     0.08%  qemu-kvm  [vdso]                        [.] 0x7fff8299f9be  
     0.08%  qemu-kvm  [kvm]                         [k] kvm_get_msr
     0.08%  qemu-kvm  [kvm_intel]                   [k] vmx_read_guest_seg_selector
     0.08%  qemu-kvm  [kvm]                         [k] kvm_io_bus_write
     0.08%  qemu-kvm  [kvm]                         [k] fetch_register_operand
     0.08%  qemu-kvm  [kvm_intel]                   [k] vmx_set_rflags
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] fsnotify
     0.08%  qemu-kvm  [kvm]                         [k] update_cr8_intercept
     0.08%  qemu-kvm  [kvm]                         [k] coalesced_mmio_write
     0.08%  qemu-kvm  [kernel.kallsyms]             [k] native_load_gdt
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] __xapic_wait_icr_idle
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] idle_cpu
     0.07%  qemu-kvm  [kvm]                         [k] kvm_check_async_pf_completion
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] do_futex
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] source_load
     0.07%  qemu-kvm  [kvm]                         [k] emulator_write_emulated
     0.07%  qemu-kvm  [tun]                         [k] tun_chr_poll
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] recalc_sigpending_tsk
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] irq_entries_start
     0.07%  qemu-kvm  [kvm]                         [k] apic_reg_write
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] set_current_blocked
     0.07%  qemu-kvm  libpthread-2.14.so            [.] __read
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_unlock
     0.07%  qemu-kvm  [kvm]                         [k] handle_mmio_page_fault_common
     0.07%  qemu-kvm  [kvm_intel]                   [k] vmx_get_rflags
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] sigprocmask
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] sock_poll
     0.07%  qemu-kvm  [kvm]                         [k] kvm_make_request
     0.07%  qemu-kvm  [kernel.kallsyms]             [k] __srcu_read_unlock
     0.06%  qemu-kvm  [kvm]                         [k] kvm_irq_line_state
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] __pollwait
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] do_notify_resume
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] __mutex_unlock_slowpath
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] get_partial_node
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] futex_wait
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] futex_wake
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] rcu_sched_qs
     0.06%  qemu-kvm  librt-2.14.so                 [.] __GI_clock_gettime
     0.06%  qemu-kvm  [kvm]                         [k] apic_find_highest_irr
     0.06%  qemu-kvm  [kvm]                         [k] vcpu_mmio_read
     0.06%  qemu-kvm  [kvm]                         [k] kvm_read_guest_page
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_unlock_irq
     0.06%  qemu-kvm  [kvm]                         [k] em_mov
     0.06%  qemu-kvm  [kvm]                         [k] trace_kvm_mmio
     0.06%  qemu-kvm  [kvm]                         [k] kvm_read_guest
     0.06%  qemu-kvm  [kvm_intel]                   [k] vmx_interrupt_allowed
     0.06%  qemu-kvm  [kvm]                         [k] read_prepare
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] zero_fd_set
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] ttwu_stat
     0.06%  qemu-kvm  [kernel.kallsyms]             [k] user_return_notifier_unregister
     0.05%  qemu-kvm  [kvm]                         [k] kvm_get_rflags
     0.05%  qemu-kvm  [kvm]                         [k] trace_kvm_mmu_paging_element
     0.05%  qemu-kvm  [kvm]                         [k] kvm_sched_in
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] check_valid_pointer
     0.05%  qemu-kvm  libc-2.14.so                  [.] __libc_select
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] core_sys_select
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] clear_buddies
     0.05%  qemu-kvm  [kvm]                         [k] segmented_write
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] mntput
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] tcp_v4_rcv
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] test_tsk_thread_flag
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] rw_verify_area
     0.05%  qemu-kvm  [tun]                         [k] tun_get_user
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] unix_poll
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] __slab_alloc
     0.05%  qemu-kvm  [kvm]                         [k] srcu_read_unlock
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] account_entity_enqueue
     0.05%  qemu-kvm  [kvm]                         [k] kvm_lapic_sync_from_vapic
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] signalfd_poll
     0.05%  qemu-kvm  [kvm]                         [k] kvm_lapic_get_cr8
     0.05%  qemu-kvm  [kvm]                         [k] vcpu_load
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] hash_futex
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] do_raw_spin_lock
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] native_smp_send_reschedule
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] audit_free_aux
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] find_busiest_group
     0.05%  qemu-kvm  [kvm_intel]                   [k] paravirt_read_msr
     0.05%  qemu-kvm  [kvm]                         [k] emulator_pio_in_emulated
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] poll_select_copy_remaining
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] lock_hrtimer_base
     0.05%  qemu-kvm  [kernel.kallsyms]             [k] select_estimate_accuracy
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_save
     0.04%  qemu-kvm  [kvm]                         [k] atomic_inc
     0.04%  qemu-kvm  [bridge]                      [k] br_handle_frame
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] recalc_sigpending
     0.04%  qemu-kvm  libpthread-2.14.so            [.] __lll_unlock_wake
     0.04%  qemu-kvm  [kvm_intel]                   [k] vmx_vcpu_put
     0.04%  qemu-kvm  [kvm]                         [k] kvm_read_cr4_bits
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] __netif_receive_skb
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] paravirt_read_tsc
     0.04%  qemu-kvm  [kvm]                         [k] apic_mmio_write
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] __math_state_restore
     0.04%  qemu-kvm  [kvm_intel]                   [k] kvm_read_cr0_bits
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] task_ca
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] signalfd_read
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] pick_next_task
     0.04%  qemu-kvm  [kvm]                         [k] kvm_set_rflags
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] check_slab
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] dequeue_task
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] sys_read
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] sys_select
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_save
     0.04%  qemu-kvm  [kvm_intel]                   [k] emulate_instruction
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] debug_mutex_add_waiter
     0.04%  qemu-kvm  [kvm]                         [k] kvm_apic_present
     0.04%  qemu-kvm  [kvm]                         [k] kvm_vcpu_block
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] plist_check_list
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] hrtick_update
     0.04%  qemu-kvm  [kvm]                         [k] arch_local_irq_disable
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] auditsys
     0.04%  qemu-kvm  [kvm]                         [k] vcpu_put
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] ip_rcv
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_enable
     0.04%  qemu-kvm  [kvm]                         [k] kvm_ioapic_set_irq
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] vfs_read
     0.04%  qemu-kvm  [tun]                         [k] __tun_get
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] timerqueue_del
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] __do_softirq
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] mutex_unlock
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] on_freelist
     0.04%  qemu-kvm  [xt_tcpudp]                   [k] tcp_mt
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] mutex_lock_nested
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] __dequeue_signal
     0.04%  qemu-kvm  [kvm]                         [k] kvm_put_guest_fpu
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] debug_mutex_lock_common
     0.04%  qemu-kvm  [kernel.kallsyms]             [k] plist_check_head
     0.04%  qemu-kvm  [kvm]                         [k] kvm_lapic_enabled
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] kmem_cache_alloc_node
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_try_to_cancel
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __rcu_pending
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __wake_up_common
     0.03%  qemu-kvm  libc-2.14.so                  [.] __libc_writev
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] ip_route_input_common
     0.03%  qemu-kvm  libc-2.14.so                  [.] __libc_disable_asynccancel
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __inet_lookup_established
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] common_interrupt
     0.03%  qemu-kvm  [kvm_intel]                   [k] vmx_set_interrupt_shadow
     0.03%  qemu-kvm  [kvm]                         [k] trace_vcpu_match_mmio
     0.03%  qemu-kvm  [kvm_intel]                   [k] init_module
     0.03%  qemu-kvm  [kvm_intel]                   [k] update_cr8_intercept
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] read_tsc
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] debug_mutex_free_waiter
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] eventfd_poll
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] _flat_send_IPI_mask
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] task_waking_fair
     0.03%  qemu-kvm  [bridge]                      [k] br_nf_pre_routing
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] user_return_notifier_register
     0.03%  qemu-kvm  [kvm]                         [k] decode_register
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] finish_task_switch
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] pick_next_task_stop
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __percpu_counter_add
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] rb_insert_color
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_init
     0.03%  qemu-kvm  [kvm]                         [k] trace_kvm_emulate_insn
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] get_rps_cpu
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] atomic_dec_and_test
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] set_normalized_timespec
     0.03%  qemu-kvm  [tun]                         [k] tun_do_read
     0.03%  qemu-kvm  libpthread-2.14.so            [.] __errno_location
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] schedule_hrtimeout_range_clock
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] poll_freewait
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] _local_bh_enable_ip
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] preempt_notifier_unregister
     0.03%  qemu-kvm  [kvm]                         [k] rcu_read_lock
     0.03%  qemu-kvm  [kvm]                         [k] shadow_walk_okay
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_unlock
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] int_with_check
     0.03%  qemu-kvm  [kvm]                         [k] apic_reg_read
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] check_object
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __remove_hrtimer
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] next_signal
     0.03%  qemu-kvm  [nf_conntrack]                [k] nf_conntrack_in
     0.03%  qemu-kvm  [kvm]                         [k] check_perm_in
     0.03%  qemu-kvm  [kvm]                         [k] kvm_arch_vcpu_runnable
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] flat_send_IPI_mask
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] radix_tree_lookup_element
     0.03%  qemu-kvm  [kvm]                         [k] copy_from_user
     0.03%  qemu-kvm  [kvm]                         [k] kvm_irq_delivery_to_apic
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] list_del
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] ktime_get_ts
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __alloc_skb
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] copy_from_user
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] x86_pmu_disable
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] timekeeping_get_ns
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] task_of
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] lock_contended
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] atomic_inc
     0.03%  qemu-kvm  [bridge]                      [k] br_nf_pre_routing_finish
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] fsnotify_access
     0.03%  qemu-kvm  [kvm]                         [k] arch_local_irq_enable
     0.03%  qemu-kvm  [kvm]                         [k] kvm_ioapic_update_eoi
     0.03%  qemu-kvm  [kvm]                         [k] kvm_get_cr8
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] sysret_audit
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] enqueue_hrtimer
     0.03%  qemu-kvm  [kvm]                         [k] write_register_operand
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     0.03%  qemu-kvm  [kvm_intel]                   [k] paravirt_write_msr
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __local_bh_disable
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_interrupt
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __lock_timer
     0.03%  qemu-kvm  [kvm]                         [k] kvm_read_cr0_bits
     0.03%  qemu-kvm  libpthread-2.14.so            [.] __pthread_enable_asynccancel
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_cancel
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __kmalloc_node_track_caller
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     0.03%  qemu-kvm  libpthread-2.14.so            [.] __libc_sigaction
     0.03%  qemu-kvm  [kernel.kallsyms]             [k] __fsnotify_parent
     0.03%  qemu-kvm  [kvm]                         [k] walk_shadow_page_lockless_begin
     0.03%  qemu-kvm  [kvm_intel]                   [k] handle_apic_access
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] __phys_addr
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] handle_edge_irq
     0.02%  qemu-kvm  [bridge]                      [k] NF_HOOK_THRESH
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] update_rq_clock
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] add_wait_queue
     0.02%  qemu-kvm  [iptable_nat]                 [k] nf_nat_fn
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] select_task_rq
     0.02%  qemu-kvm  libc-2.14.so                  [.] _int_free
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] math_state_restore
     0.02%  qemu-kvm  [nf_conntrack]                [k] tcp_error
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] signal_pending
     0.02%  qemu-kvm  [bridge]                      [k] br_handle_frame_finish
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] vfs_writev
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] do_IRQ
     0.02%  qemu-kvm  [kvm]                         [k] kvm_lapic_find_highest_irr
     0.02%  qemu-kvm  [tun]                         [k] atomic_dec_and_test
     0.02%  qemu-kvm  [tun]                         [k] tun_sock_write_space
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] debug_mutex_unlock
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] scheduler_tick
     0.02%  qemu-kvm  librt-2.14.so                 [.] __timer_gettime_new
     0.02%  qemu-kvm  [kvm]                         [k] apic_has_pending_timer
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] common_timer_get
     0.02%  qemu-kvm  libc-2.14.so                  [.] __gettimeofday_internal
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] do_softirq
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] mutex_remove_waiter
     0.02%  qemu-kvm  [kvm]                         [k] emulator_read_emulated
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] call_softirq
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] process_backlog
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] ip_local_deliver_finish
     0.02%  qemu-kvm  [kvm]                         [k] __apic_accept_irq
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] __mod_timer
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] futex_wait_queue_me
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] perf_pmu_disable
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] account_group_exec_runtime
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] remove_wait_queue
     0.02%  qemu-kvm  [kvm]                         [k] pic_set_irq1
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] run_timer_softirq
     0.02%  qemu-kvm  [kvm]                         [k] ioapic_mmio_read
     0.02%  qemu-kvm  [kvm]                         [k] kvm_arch_vcpu_put
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] set_fd_set
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] preempt_notifier_register
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] load_balance
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] net_rx_action
     0.02%  qemu-kvm  [bridge]                      [k] compare_ether_addr
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] ip_rcv_finish
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] ktime_add_safe
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] __set_task_blocked
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] sock_poll_wait
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] prepare_to_wait
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] futex_wait_setup
     0.02%  qemu-kvm  [iptable_mangle]              [k] iptable_mangle_hook
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] sock_alloc_send_pskb
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] pick_next_task_rt
     0.02%  qemu-kvm  libpthread-2.14.so            [.] sigaction
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] _copy_from_user
     0.02%  qemu-kvm  [kvm]                         [k] kvm_notify_acked_irq
     0.02%  qemu-kvm  [kvm]                         [k] apic_set_vector
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] handle_irq_event_percpu
     0.02%  qemu-kvm  [kvm]                         [k] start_apic_timer
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] sys_timer_gettime
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] __clear_user
     0.02%  qemu-kvm  [nf_conntrack]                [k] ____nf_conntrack_find
     0.02%  qemu-kvm  [kvm]                         [k] is_shadow_present_pte
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] idr_find
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_disable
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] tg_load_down
     0.02%  qemu-kvm  libc-2.14.so                  [.] __libc_enable_asynccancel
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] trace_softirqs_off
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] update_shares
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] native_apic_mem_write
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] task_tick_fair
     0.02%  qemu-kvm  [kvm]                         [k] write_exit_mmio
     0.02%  qemu-kvm  [kvm]                         [k] trace_kvm_apic
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] rcu_note_context_switch
     0.02%  qemu-kvm  [kvm]                         [k] kvm_pic_set_irq
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] common_timer_set
     0.02%  qemu-kvm  [bridge]                      [k] br_port_get_rcu
     0.02%  qemu-kvm  [kvm]                         [k] ioapic_mmio_write
     0.02%  qemu-kvm  [kvm_intel]                   [k] vmx_inject_irq
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.02%  qemu-kvm  [kvm]                         [k] pic_update_irq
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] cpumask_next_and
     0.02%  qemu-kvm  [kvm]                         [k] kvm_apic_match_logical_addr
     0.02%  qemu-kvm  [kvm]                         [k] shadow_walk_init
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] do_sigaction
     0.02%  qemu-kvm  [nf_nat]                      [k] nf_nat_packet
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] update_cpu_load
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] deactivate_task
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] pick_next_task_fair
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] poll_schedule_timeout
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] clockevents_program_event
     0.02%  qemu-kvm  [kvm]                         [k] pic_unlock
     0.02%  qemu-kvm  [kvm]                         [k] kvm_arch_interrupt_allowed
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] deactivate_slab
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] kmem_cache_alloc_trace
     0.02%  qemu-kvm  [nf_conntrack]                [k] __nf_ct_l3proto_find
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] nf_hook_slow
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] slab_pre_alloc_hook
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] sys_timer_settime
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] hrtick_start_fair
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] intel_pmu_disable_all
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] cpu_avg_load_per_task
     0.02%  qemu-kvm  [kvm]                         [k] pit_ioport_read
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] dequeue_signal
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] account_system_time
     0.02%  qemu-kvm  [kvm]                         [k] ioapic_service
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] enqueue_to_backlog
     0.02%  qemu-kvm  [kernel.kallsyms]             [k] fsnotify_modify
     0.01%  qemu-kvm  libc-2.14.so                  [.] _int_malloc
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sys_writev
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] plist_add
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] finish_wait
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] test_tsk_thread_flag
     0.01%  qemu-kvm  libpthread-2.14.so            [.] __pthread_disable_asynccancel
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sock_update_classid
     0.01%  qemu-kvm  [tun]                         [k] netif_wake_queue
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] eth_type_trans
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_dst
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_dequeue
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_pull_rcsum
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] do_readv_writev
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.01%  qemu-kvm  [tun]                         [k] tun_chr_aio_write
     0.01%  qemu-kvm  [kvm_intel]                   [k] is_paging
     0.01%  qemu-kvm  [tun]                         [k] tun_chr_aio_read
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] trace_softirqs_on
     0.01%  qemu-kvm  [nf_conntrack]                [k] __nf_ct_l4proto_find
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] dst_metric
     0.01%  qemu-kvm  librt-2.14.so                 [.] __timer_settime_new
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] perf_event_task_tick
     0.01%  qemu-kvm  qemu-kvm                      [.] ioctl@plt
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] put_task_struct
     0.01%  qemu-kvm  [kvm]                         [k] apic_mmio_read
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] timespec_add_safe
     0.01%  qemu-kvm  [kvm]                         [k] kvm_rip_read
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rb_next
     0.01%  qemu-kvm  libz.so.1.2.5                 [.] 0x29e0          
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] signal_pending_state
     0.01%  qemu-kvm  libpthread-2.14.so            [.] _L_lock_758
     0.01%  qemu-kvm  [bridge]                      [k] br_parse_ip_options
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] alloc_debug_processing
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] find_next_bit
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] netif_rx
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] add_partial
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_put
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] cpuacct_update_stats
     0.01%  qemu-kvm  [kvm]                         [k] emulator_bad_iopl
     0.01%  qemu-kvm  [bridge]                      [k] __br_fdb_get
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] irq_exit
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] perf_event_task_sched_out
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_copy_datagram_from_iovec
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] set_track
     0.01%  qemu-kvm  [kvm_intel]                   [k] skip_emulated_instruction
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_release_head_state
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] irq_enter
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __slab_free
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] get_futex_key_refs
     0.01%  qemu-kvm  [kvm]                         [k] speaker_ioport_read
     0.01%  qemu-kvm  [iptable_filter]              [k] iptable_filter_hook
     0.01%  qemu-kvm  [bridge]                      [k] br_fdb_update
     0.01%  qemu-kvm  [bridge]                      [k] nf_bridge_encap_header_len
     0.01%  qemu-kvm  libc-2.14.so                  [.] __malloc
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] irq_complete_move
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_start
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] int_restore_rest
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_bh_qs
     0.01%  qemu-kvm  [kvm]                         [k] kvm_make_request
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_exit_nohz
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] get_parent_ip
     0.01%  qemu-kvm  [nf_conntrack_ipv4]           [k] ipv4_confirm
     0.01%  qemu-kvm  [iptable_nat]                 [k] nf_nat_in
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] x86_pmu_enable
     0.01%  qemu-kvm  [nf_conntrack]                [k] segment_seq_plus_len
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] hpet_msi_next_event
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_enter_nohz
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] profile_tick
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_disable
     0.01%  qemu-kvm  [kvm]                         [k] kvm_vcpu_kick
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] handle_irq_event
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] atomic_dec_and_test
     0.01%  qemu-kvm  [bridge]                      [k] pskb_may_pull
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_check_callbacks
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] pick_next_task_idle
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] paravirt_write_msr
     0.01%  qemu-kvm  [kvm_intel]                   [k] handle_io
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] inet_csk_reset_xmit_timer.constprop.14
     0.01%  qemu-kvm  [nf_conntrack]                [k] hash_conntrack_raw
     0.01%  qemu-kvm  libc-2.14.so                  [.] malloc_consolidate.part.3
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] atomic_dec_and_test
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] ip_local_deliver
     0.01%  qemu-kvm  [bridge]                      [k] br_port_get_rcu
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] update_process_times
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] int_signal
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] int_check_syscall_exit_work
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] do_sync_readv_writev
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sys_rt_sigaction
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] atomic_add
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] tick_program_event
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __run_hrtimer
     0.01%  qemu-kvm  [kvm_intel]                   [k] handle_halt
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] csum_partial
     0.01%  qemu-kvm  [nf_conntrack]                [k] atomic_add_unless.constprop.5
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] netif_receive_skb
     0.01%  qemu-kvm  libpthread-2.14.so            [.] _L_unlock_686
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] radix_tree_lookup
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] eventfd_write
     0.01%  qemu-kvm  [kvm_intel]                   [k] __invept.constprop.48
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] tick_sched_timer
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] run_local_timers
     0.01%  qemu-kvm  [nf_conntrack]                [k] __nf_conntrack_find_get
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sk_sleep
     0.01%  qemu-kvm  [nf_conntrack_ipv4]           [k] ipv4_get_l4proto
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] do_device_not_available
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] schedule_hrtimeout_range
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_dst_set_noref
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] do_sync_read
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] vfs_write
     0.01%  qemu-kvm  [kvm_intel]                   [k] kvm_register_read
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] tick_dev_program_event
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] _copy_to_user
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __bit_spin_unlock.constprop.21
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] drop_futex_key_refs
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] get_fd_set
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] set_next_buddy
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] hpet_interrupt_handler
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] lock_point
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] NF_HOOK.constprop.2
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __rb_rotate_left
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __rb_rotate_right
     0.01%  qemu-kvm  [kvm]                         [k] kvm_set_pic_irq
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] fput_light
     0.01%  qemu-kvm  [bridge]                      [k] br_mac_hash
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] nf_ip_checksum
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] debug_check_no_locks_freed
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_unlock
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] atomic_dec_and_test
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] save_args
     0.01%  qemu-kvm  [kvm_intel]                   [k] __invvpid.constprop.47
     0.01%  qemu-kvm  qemu-kvm                      [.] pthread_mutex_unlock@plt
     0.01%  qemu-kvm  [nf_nat]                      [k] nf_nat_get_offset
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_run_queues
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_lock_nested
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_push
     0.01%  qemu-kvm  [bridge]                      [k] nf_bridge_put
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] debug_deactivate
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] memcpy_fromiovecend
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] default_wake_function
     0.01%  qemu-kvm  [nf_conntrack_ipv4]           [k] ipv4_conntrack_in
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] srcu_read_lock_held.constprop.2
     0.01%  qemu-kvm  libpthread-2.14.so            [.] __libc_write
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] eventfd_ctx_read
     0.01%  qemu-kvm  [kvm]                         [k] kvm_apic_match_dest
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] slab_free_hook
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_copy_datagram_const_iovec
     0.01%  qemu-kvm  [bridge]                      [k] setup_pre_routing
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_forward
     0.01%  qemu-kvm  [bridge]                      [k] rcu_read_lock_held
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sock_flag
     0.01%  qemu-kvm  [kvm]                         [k] rcu_read_unlock
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] exit_idle
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_irq_enter
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] do_timer
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] lockdep_trace_alloc
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] skb_release_data
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] ctx_sched_out
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_start_range_ns
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] detach_timer
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] printk_tick
     0.01%  qemu-kvm  [bridge]                      [k] ip_sabotage_in
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] account_process_tick
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_save
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_unlock
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] mod_timer
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sk_filter
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] clockid_to_kclock
     0.01%  qemu-kvm  [nf_conntrack]                [k] __nf_ct_refresh_acct
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] compound_order
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] error_exit
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __enqueue_entity
     0.01%  qemu-kvm  [bridge]                      [k] has_expired
     0.01%  qemu-kvm  [kvm]                         [k] picdev_read
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] error_entry
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __delay
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] inet_ehashfn
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] acct_update_integrals
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] free_debug_processing
     0.01%  qemu-kvm  [nf_conntrack]                [k] skb_header_pointer
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.01%  qemu-kvm  [kvm]                         [k] kvm_cpu_has_pending_timer
     0.01%  qemu-kvm  [kvm]                         [k] kernel_pio
     0.01%  qemu-kvm  [kvm]                         [k] kvm_emulate_halt
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] ack_apic_edge
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] irq_move_irq
     0.01%  qemu-kvm  [nf_nat]                      [k] nfct_nat
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_enable
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] irq_to_desc
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] atomic_add_unless.constprop.2
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] irq_work_run
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] pagefault_enable
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.01%  qemu-kvm  [xt_tcpudp]                   [k] skb_header_pointer
     0.01%  qemu-kvm  qemu-kvm                      [.] write@plt
     0.01%  qemu-kvm  [kvm]                         [k] kvm_make_request
     0.01%  qemu-kvm  [nf_conntrack]                [k] nf_ct_get_tuple
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sock_wfree
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sk_reset_timer
     0.01%  qemu-kvm  qemu-kvm                      [.] gettimeofday@plt
     0.01%  qemu-kvm  [bridge]                      [k] vlan_proto
     0.01%  qemu-kvm  [kvm]                         [k] kvm_get_apic_base
     0.01%  qemu-kvm  [ip_tables]                   [k] xt_write_recseq_begin
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] sys_write
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] pskb_may_pull
     0.01%  qemu-kvm  qemu-kvm                      [.] read@plt
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] intel_pmu_nhm_enable_all
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] raise_softirq
     0.01%  qemu-kvm  [nf_conntrack_ipv4]           [k] ipv4_pkt_to_tuple
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] tracesys
     0.01%  qemu-kvm  qemu-kvm                      [.] memcpy@plt
     0.01%  qemu-kvm  [nf_conntrack]                [k] nf_ct_tuple_equal
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] lock_timer_base
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] get_futex_value_locked
     0.01%  qemu-kvm  [kvm]                         [k] emulator_get_cpl
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] prefetchw
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] autoremove_wake_function
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] ns_to_timespec
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __sk_dst_get
     0.01%  qemu-kvm  [nf_defrag_ipv4]              [k] nf_defrag_fini
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_unlock
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] note_interrupt
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] send_sigqueue
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __perf_event_task_sched_in
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_unlock
     0.01%  qemu-kvm  [nf_defrag_ipv4]              [k] ipv4_conntrack_defrag
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rt_hash
     0.01%  qemu-kvm  [nf_conntrack]                [k] tcp_pkt_to_tuple
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] rw_copy_check_uvector
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] set_task_rq
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] wake_up_state
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] __raise_softirq_irqoff
     0.01%  qemu-kvm  [kernel.kallsyms]             [k] set_task_cpu
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] skb_dst
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] bit_spin_lock.constprop.22
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] raw_local_deliver
     0.00%  qemu-kvm  [kvm]                         [k] rcu_read_lock_held
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __kfree_skb
     0.00%  qemu-kvm  libpthread-2.14.so            [.] 0xdf69          
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] task_nice
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __raise_softirq_irqoff.constprop.37
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] insert_work
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_force_reprogram
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] jiffies_to_timeval
     0.00%  qemu-kvm  [kvm]                         [k] walk_shadow_page_lockless_end
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] trace_kmalloc_node
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __local_bh_enable
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] poll_initwait
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_lock_bh
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] debug_activate
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] net_rps_action_and_irq_enable
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] get_slab
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] enqueue_task_fair
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] local_bh_disable
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] wants_signal
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] handle_irq
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __sigqueue_free
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_restore
     0.00%  qemu-kvm  [kvm]                         [k] kvm_get_apic_interrupt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] enqueue_task
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] call_function_single_interrupt
     0.00%  qemu-kvm  [kvm]                         [k] __shadow_walk_next
     0.00%  qemu-kvm  [bridge]                      [k] rcu_read_lock_held
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] sched_clock_tick
     0.00%  qemu-kvm  [kvm_intel]                   [k] segment_base
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] run_posix_cpu_timers
     0.00%  qemu-kvm  [kvm]                         [k] lapic_is_periodic
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] memcpy_toiovecend
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] poll_select_set_timeout
     0.00%  qemu-kvm  [kvm_intel]                   [k] vmx_flush_tlb
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __wake_up
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] pollwake
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] ttwu_do_wakeup
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] monotonic_to_bootbased
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __skb_checksum_complete_head
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] tick_do_update_jiffies64
     0.00%  qemu-kvm  [bridge]                      [k] bridge_parent_rtable
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] next_thread
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] internal_add_timer
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] atomic_cmpxchg
     0.00%  qemu-kvm  [bridge]                      [k] nf_bridge_alloc
     0.00%  qemu-kvm  [kvm]                         [k] read_exit_mmio
     0.00%  qemu-kvm  qemu-kvm                      [.] clock_gettime@plt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] match_futex
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] native_send_call_func_single_ipi
     0.00%  qemu-kvm  [bridge]                      [k] br_nf_local_in
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] apic_write
     0.00%  qemu-kvm  [kvm_intel]                   [k] vmx_clear_hlt
     0.00%  qemu-kvm  [bridge]                      [k] nf_bridge_update_protocol
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __queue_work
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] find_first_bit
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] activate_task
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] plist_del
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] _raw_spin_unlock_bh
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] target_load
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] copy_from_user
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __lock_task_sighand
     0.00%  qemu-kvm  [kvm]                         [k] __kvm_migrate_pit_timer
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] ack_APIC_irq
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     0.00%  qemu-kvm  [kvm]                         [k] kvm_timer_fn
     0.00%  qemu-kvm  libc-2.14.so                  [.] __memset_sse2
     0.00%  qemu-kvm  [kvm]                         [k] pic_get_irq
     0.00%  qemu-kvm  [kvm_intel]                   [k] handle_exception
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] pagefault_disable
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] smp_call_function_single_interrupt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_start_expires
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] check_preempt_curr
     0.00%  qemu-kvm  [kvm]                         [k] __kvm_migrate_apic_timer
     0.00%  qemu-kvm  qemu-kvm                      [.] select@plt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] native_flush_tlb_global
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] atomic_sub
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] resched_task
     0.00%  qemu-kvm  [bridge]                      [k] NF_HOOK.constprop.1
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] csd_lock
     0.00%  qemu-kvm  qemu-kvm                      [.] timer_gettime@plt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __enqueue_rt_entity
     0.00%  qemu-kvm  [kvm]                         [k] kvm_cpu_get_interrupt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] smp_call_function_single
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] invoke_rcu_cpu_kthread
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] check_tsc_unstable
     0.00%  qemu-kvm  [kvm]                         [k] apic_clear_vector
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] netif_rx_ni
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] reschedule_interrupt
     0.00%  qemu-kvm  libc-2.14.so                  [.] calloc
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] eventfd_read
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] intel_pmu_enable_all
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] virt_to_head_page
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] check_preempt_wakeup
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] prepare_signal
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] local_bh_enable_ip
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] wake_futex
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] sock_recvmsg
     0.00%  qemu-kvm  [kvm]                         [k] kvm_set_ioapic_irq
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] perf_pmu_enable
     0.00%  qemu-kvm  [bridge]                      [k] nf_bridge_pull_encap_header_rcsum
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] trace_workqueue_activate_work
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] enqueue_task_rt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] perf_event_context_sched_in
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] wakeup_preempt_entity
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_irq_exit
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] kthread_data
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] test_tsk_need_resched
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __put_user_8
     0.00%  qemu-kvm  [kvm]                         [k] kvm_apic_compare_prio
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] kfree_skb
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] generic_smp_call_function_single_interrupt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] timespec_add
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_enable
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] wake_up_worker
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __unqueue_futex
     0.00%  qemu-kvm  [kvm_intel]                   [k] vmx_cancel_injection
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] skb_checksum
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __skb_dequeue
     0.00%  qemu-kvm  libc-2.14.so                  [.] __strcmp_sse42
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] mix_pool_bytes_extract
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] place_entity
     0.00%  qemu-kvm  libc-2.14.so                  [.] __cfree
     0.00%  qemu-kvm  [kvm]                         [k] trace_kvm_pio
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.00%  qemu-kvm  [kvm_intel]                   [k] loaded_vmcs_init
     0.00%  qemu-kvm  [kvm]                         [k] arch_local_save_flags
     0.00%  qemu-kvm  libc-2.14.so                  [.] _itoa_word
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] posix_timer_fn
     0.00%  qemu-kvm  [kvm_intel]                   [k] vpid_sync_context
     0.00%  qemu-kvm  [ip_tables]                   [k] xt_write_recseq_end
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] local_bh_enable
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] group_first_cpu
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] format_decode
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] raise_softirq_irqoff
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_run_pending
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __inet_lookup.constprop.13
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] hid_report_raw_event
     0.00%  qemu-kvm  [kvm_intel]                   [k] handle_interrupt_window
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_save_flags
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_unlock
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __cond_resched
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] set_next_entity
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] arch_local_irq_disable
     0.00%  qemu-kvm  [kvm_intel]                   [k] __loaded_vmcs_clear
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __wake_up_sync_key
     0.00%  qemu-kvm  [kvm_intel]                   [k] handle_external_interrupt
     0.00%  qemu-kvm  [nf_conntrack]                [k] rcu_read_lock
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] getnstimeofday
     0.00%  qemu-kvm  [kvm]                         [k] kvm_apic_set_irq
     0.00%  qemu-kvm  qemu-kvm                      [.] pthread_mutex_lock@plt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] copy_from_user
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] in_lock_functions
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] delayed_work_timer_fn
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] sock_put
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __task_rq_lock
     0.00%  qemu-kvm  [tun]                         [k] tun_put
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] set_tsk_need_resched
     0.00%  qemu-kvm  qemu-kvm                      [.] timer_settime@plt
     0.00%  qemu-kvm  libc-2.14.so                  [.] __GI_vfprintf
     0.00%  qemu-kvm  [kvm]                         [k] mark_page_dirty
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] __skb_checksum_complete
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] string
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] complete_signal
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] cpumask_copy
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rt_task_of
     0.00%  qemu-kvm  qemu-kvm                      [.] __errno_location@plt
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] update_stats_wait_end
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_unlock
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] cgroup_lock_is_held
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] verify_iovec
     0.00%  qemu-kvm  libpthread-2.14.so            [.] __recvmsg
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] unix_stream_recvmsg
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] cpumask_next
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] scm_destroy
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_unlock
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rt_se_prio
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] atomic_inc
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] rcu_read_lock_held
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] get_work_cwq
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] hrtimer_start_expires
     0.00%  qemu-kvm  [kernel.kallsyms]             [k] msecs_to_jiffies


#
# (For a higher level overview, try: perf report --sort comm,dso)
#

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 01/22] KVM: MMU: fix walking shadow page table
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
@ 2011-06-22 14:28 ` Xiao Guangrong
  2011-06-22 17:13   ` Marcelo Tosatti
  2011-06-22 14:28 ` [PATCH v2 02/22] KVM: MMU: do not update slot bitmap if spte is nonpresent Xiao Guangrong
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Properly check the last mapping, and do not walk to the next level if last spte
is met

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9c629b5..f474e93 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
 	if (iterator->level < PT_PAGE_TABLE_LEVEL)
 		return false;
 
-	if (iterator->level == PT_PAGE_TABLE_LEVEL)
-		if (is_large_pte(*iterator->sptep))
-			return false;
-
 	iterator->index = SHADOW_PT_INDEX(iterator->addr, iterator->level);
 	iterator->sptep	= ((u64 *)__va(iterator->shadow_addr)) + iterator->index;
 	return true;
@@ -1528,6 +1524,11 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
 
 static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 {
+	if (is_last_spte(*iterator->sptep, iterator->level)) {
+		iterator->level = 0;
+		return;
+	}
+
 	iterator->shadow_addr = *iterator->sptep & PT64_BASE_ADDR_MASK;
 	--iterator->level;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 02/22] KVM: MMU: do not update slot bitmap if spte is nonpresent
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
  2011-06-22 14:28 ` [PATCH v2 01/22] KVM: MMU: fix walking shadow page table Xiao Guangrong
@ 2011-06-22 14:28 ` Xiao Guangrong
  2011-06-22 14:29 ` [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary Xiao Guangrong
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Set slot bitmap only if the spte is present

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |   15 +++++++--------
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f474e93..8316c2d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -743,9 +743,6 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 	struct kvm_mmu_page *sp;
 	unsigned long *rmapp;
 
-	if (!is_rmap_spte(*spte))
-		return 0;
-
 	sp = page_header(__pa(spte));
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp->role.level);
@@ -2087,11 +2084,13 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	if (!was_rmapped && is_large_pte(*sptep))
 		++vcpu->kvm->stat.lpages;
 
-	page_header_update_slot(vcpu->kvm, sptep, gfn);
-	if (!was_rmapped) {
-		rmap_count = rmap_add(vcpu, sptep, gfn);
-		if (rmap_count > RMAP_RECYCLE_THRESHOLD)
-			rmap_recycle(vcpu, sptep, gfn);
+	if (is_shadow_present_pte(*sptep)) {
+		page_header_update_slot(vcpu->kvm, sptep, gfn);
+		if (!was_rmapped) {
+			rmap_count = rmap_add(vcpu, sptep, gfn);
+			if (rmap_count > RMAP_RECYCLE_THRESHOLD)
+				rmap_recycle(vcpu, sptep, gfn);
+		}
 	}
 	kvm_release_pfn_clean(pfn);
 	if (speculative) {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
  2011-06-22 14:28 ` [PATCH v2 01/22] KVM: MMU: fix walking shadow page table Xiao Guangrong
  2011-06-22 14:28 ` [PATCH v2 02/22] KVM: MMU: do not update slot bitmap if spte is nonpresent Xiao Guangrong
@ 2011-06-22 14:29 ` Xiao Guangrong
  2011-06-29  8:21   ` Avi Kivity
  2011-06-22 14:29 ` [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code Xiao Guangrong
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

If the range spans a boundary, the mmio access can be broke, fix it as
write emulation.

And we already get the guest physical address, so use it to read guest data
directly to avoid walking guest page table again

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/x86.c |   41 ++++++++++++++++++++++++++++++++---------
 1 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0b803f0..eb27be4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3944,14 +3944,13 @@ out:
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
-static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
-				  unsigned long addr,
-				  void *val,
-				  unsigned int bytes,
-				  struct x86_exception *exception)
+static int emulator_read_emulated_onepage(unsigned long addr,
+					  void *val,
+					  unsigned int bytes,
+					  struct x86_exception *exception,
+					  struct kvm_vcpu *vcpu)
 {
-	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-	gpa_t                 gpa;
+	gpa_t gpa;
 	int handled;
 
 	if (vcpu->mmio_read_completed) {
@@ -3971,8 +3970,7 @@ static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
 	if ((gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
 		goto mmio;
 
-	if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception)
-	    == X86EMUL_CONTINUE)
+	if (!kvm_read_guest(vcpu->kvm, gpa, val, bytes))
 		return X86EMUL_CONTINUE;
 
 mmio:
@@ -4001,6 +3999,31 @@ mmio:
 	return X86EMUL_IO_NEEDED;
 }
 
+static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
+				  unsigned long addr,
+				  void *val,
+				  unsigned int bytes,
+				  struct x86_exception *exception)
+{
+	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+
+	/* Crossing a page boundary? */
+	if (((addr + bytes - 1) ^ addr) & PAGE_MASK) {
+		int rc, now;
+
+		now = -addr & ~PAGE_MASK;
+		rc = emulator_read_emulated_onepage(addr, val, now, exception,
+							vcpu);
+		if (rc != X86EMUL_CONTINUE)
+			return rc;
+		addr += now;
+		val += now;
+		bytes -= now;
+	}
+	return emulator_read_emulated_onepage(addr, val, bytes, exception,
+						vcpu);
+}
+
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 			const void *val, int bytes)
 {
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (2 preceding siblings ...)
  2011-06-22 14:29 ` [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary Xiao Guangrong
@ 2011-06-22 14:29 ` Xiao Guangrong
  2011-06-29  8:24   ` Avi Kivity
  2011-06-22 14:30 ` [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation Xiao Guangrong
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Introduce vcpu_gva_to_gpa to translate the gva to gpa, we can use it
to cleanup the code between read emulation and write emulation

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/x86.c |   38 +++++++++++++++++++++++++++++---------
 1 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb27be4..c29ef96 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3944,6 +3944,27 @@ out:
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
+static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
+			   gpa_t *gpa, struct x86_exception *exception,
+			   bool write)
+{
+	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+
+	if (write)
+		access |= PFERR_WRITE_MASK;
+
+	*gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
+
+	if (*gpa == UNMAPPED_GVA)
+		return -1;
+
+	/* For APIC access vmexit */
+	if ((*gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+		return 1;
+
+	return 0;
+}
+
 static int emulator_read_emulated_onepage(unsigned long addr,
 					  void *val,
 					  unsigned int bytes,
@@ -3951,7 +3972,7 @@ static int emulator_read_emulated_onepage(unsigned long addr,
 					  struct kvm_vcpu *vcpu)
 {
 	gpa_t gpa;
-	int handled;
+	int handled, ret;
 
 	if (vcpu->mmio_read_completed) {
 		memcpy(val, vcpu->mmio_data, bytes);
@@ -3961,13 +3982,12 @@ static int emulator_read_emulated_onepage(unsigned long addr,
 		return X86EMUL_CONTINUE;
 	}
 
-	gpa = kvm_mmu_gva_to_gpa_read(vcpu, addr, exception);
+	ret = vcpu_gva_to_gpa(vcpu, addr, &gpa, exception, false);
 
-	if (gpa == UNMAPPED_GVA)
+	if (ret < 0)
 		return X86EMUL_PROPAGATE_FAULT;
 
-	/* For APIC access vmexit */
-	if ((gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+	if (ret)
 		goto mmio;
 
 	if (!kvm_read_guest(vcpu->kvm, gpa, val, bytes))
@@ -4043,15 +4063,15 @@ static int emulator_write_emulated_onepage(unsigned long addr,
 					   struct kvm_vcpu *vcpu)
 {
 	gpa_t                 gpa;
-	int handled;
+	int handled, ret;
 
-	gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, exception);
+	ret = vcpu_gva_to_gpa(vcpu, addr, &gpa, exception, true);
 
-	if (gpa == UNMAPPED_GVA)
+	if (ret < 0)
 		return X86EMUL_PROPAGATE_FAULT;
 
 	/* For APIC access vmexit */
-	if ((gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+	if (ret)
 		goto mmio;
 
 	if (emulator_write_phys(vcpu, gpa, val, bytes))
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (3 preceding siblings ...)
  2011-06-22 14:29 ` [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code Xiao Guangrong
@ 2011-06-22 14:30 ` Xiao Guangrong
  2011-06-29  8:37   ` Avi Kivity
  2011-06-22 14:30 ` [PATCH v2 06/22] KVM: x86: cleanup the code of " Xiao Guangrong
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

The operations of read emulation and write emulation are very similar, so we
can abstract the operation of them, in larter patch, it is used to cleanup the
same code

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/x86.c |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c29ef96..887714f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4056,6 +4056,78 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 	return 1;
 }
 
+struct read_write_emulator_ops {
+	int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
+				  int bytes);
+	int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
+				  void *val, int bytes);
+	int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       int bytes, void *val);
+	int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
+				    void *val, int bytes);
+	bool write;
+};
+
+static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
+{
+	if (vcpu->mmio_read_completed) {
+		memcpy(val, vcpu->mmio_data, bytes);
+		trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes,
+			       vcpu->mmio_phys_addr, *(u64 *)val);
+		vcpu->mmio_read_completed = 0;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
+			void *val, int bytes)
+{
+	return !kvm_read_guest(vcpu->kvm, gpa, val, bytes);
+}
+
+static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
+			 void *val, int bytes)
+{
+	return emulator_write_phys(vcpu, gpa, val, bytes);
+}
+
+static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val)
+{
+	trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
+	return vcpu_mmio_write(vcpu, gpa, bytes, val);
+}
+
+static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
+			  void *val, int bytes)
+{
+	trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, 0);
+	return X86EMUL_IO_NEEDED;
+}
+
+static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
+			   void *val, int bytes)
+{
+	memcpy(vcpu->mmio_data, val, bytes);
+	memcpy(vcpu->run->mmio.data, vcpu->mmio_data, 8);
+	return X86EMUL_CONTINUE;
+}
+
+static struct read_write_emulator_ops read_emultor = {
+	.read_write_prepare = read_prepare,
+	.read_write_emulate = read_emulate,
+	.read_write_mmio = vcpu_mmio_read,
+	.read_write_exit_mmio = read_exit_mmio,
+};
+
+static struct read_write_emulator_ops write_emultor = {
+	.read_write_emulate = write_emulate,
+	.read_write_mmio = write_mmio,
+	.read_write_exit_mmio = write_exit_mmio,
+	.write = true,
+};
+
 static int emulator_write_emulated_onepage(unsigned long addr,
 					   const void *val,
 					   unsigned int bytes,
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 06/22] KVM: x86: cleanup the code of read/write emulation
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (4 preceding siblings ...)
  2011-06-22 14:30 ` [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation Xiao Guangrong
@ 2011-06-22 14:30 ` Xiao Guangrong
  2011-06-22 14:31 ` [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path Xiao Guangrong
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Using the read/write operation to remove the same code

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/x86.c |  149 ++++++++++++++++-----------------------------------
 1 files changed, 47 insertions(+), 102 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 887714f..baa5a11 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3965,85 +3965,6 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 	return 0;
 }
 
-static int emulator_read_emulated_onepage(unsigned long addr,
-					  void *val,
-					  unsigned int bytes,
-					  struct x86_exception *exception,
-					  struct kvm_vcpu *vcpu)
-{
-	gpa_t gpa;
-	int handled, ret;
-
-	if (vcpu->mmio_read_completed) {
-		memcpy(val, vcpu->mmio_data, bytes);
-		trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes,
-			       vcpu->mmio_phys_addr, *(u64 *)val);
-		vcpu->mmio_read_completed = 0;
-		return X86EMUL_CONTINUE;
-	}
-
-	ret = vcpu_gva_to_gpa(vcpu, addr, &gpa, exception, false);
-
-	if (ret < 0)
-		return X86EMUL_PROPAGATE_FAULT;
-
-	if (ret)
-		goto mmio;
-
-	if (!kvm_read_guest(vcpu->kvm, gpa, val, bytes))
-		return X86EMUL_CONTINUE;
-
-mmio:
-	/*
-	 * Is this MMIO handled locally?
-	 */
-	handled = vcpu_mmio_read(vcpu, gpa, bytes, val);
-
-	if (handled == bytes)
-		return X86EMUL_CONTINUE;
-
-	gpa += handled;
-	bytes -= handled;
-	val += handled;
-
-	trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, 0);
-
-	vcpu->mmio_needed = 1;
-	vcpu->run->exit_reason = KVM_EXIT_MMIO;
-	vcpu->run->mmio.phys_addr = vcpu->mmio_phys_addr = gpa;
-	vcpu->mmio_size = bytes;
-	vcpu->run->mmio.len = min(vcpu->mmio_size, 8);
-	vcpu->run->mmio.is_write = vcpu->mmio_is_write = 0;
-	vcpu->mmio_index = 0;
-
-	return X86EMUL_IO_NEEDED;
-}
-
-static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
-				  unsigned long addr,
-				  void *val,
-				  unsigned int bytes,
-				  struct x86_exception *exception)
-{
-	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-
-	/* Crossing a page boundary? */
-	if (((addr + bytes - 1) ^ addr) & PAGE_MASK) {
-		int rc, now;
-
-		now = -addr & ~PAGE_MASK;
-		rc = emulator_read_emulated_onepage(addr, val, now, exception,
-							vcpu);
-		if (rc != X86EMUL_CONTINUE)
-			return rc;
-		addr += now;
-		val += now;
-		bytes -= now;
-	}
-	return emulator_read_emulated_onepage(addr, val, bytes, exception,
-						vcpu);
-}
-
 int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 			const void *val, int bytes)
 {
@@ -4128,16 +4049,22 @@ static struct read_write_emulator_ops write_emultor = {
 	.write = true,
 };
 
-static int emulator_write_emulated_onepage(unsigned long addr,
-					   const void *val,
-					   unsigned int bytes,
-					   struct x86_exception *exception,
-					   struct kvm_vcpu *vcpu)
+static int emulator_read_write_onepage(unsigned long addr, void *val,
+				       unsigned int bytes,
+				       struct x86_exception *exception,
+				       struct kvm_vcpu *vcpu,
+				       struct read_write_emulator_ops *ops)
+
 {
-	gpa_t                 gpa;
+	gpa_t gpa;
 	int handled, ret;
+	bool write = ops->write;
+
+	if (ops->read_write_prepare &&
+	      ops->read_write_prepare(vcpu, val, bytes))
+		return X86EMUL_CONTINUE;
 
-	ret = vcpu_gva_to_gpa(vcpu, addr, &gpa, exception, true);
+	ret = vcpu_gva_to_gpa(vcpu, addr, &gpa, exception, write);
 
 	if (ret < 0)
 		return X86EMUL_PROPAGATE_FAULT;
@@ -4146,15 +4073,14 @@ static int emulator_write_emulated_onepage(unsigned long addr,
 	if (ret)
 		goto mmio;
 
-	if (emulator_write_phys(vcpu, gpa, val, bytes))
+	if (ops->read_write_emulate(vcpu, gpa, val, bytes))
 		return X86EMUL_CONTINUE;
 
 mmio:
-	trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
 	/*
 	 * Is this MMIO handled locally?
 	 */
-	handled = vcpu_mmio_write(vcpu, gpa, bytes, val);
+	handled = ops->read_write_mmio(vcpu, gpa, bytes, val);
 	if (handled == bytes)
 		return X86EMUL_CONTINUE;
 
@@ -4163,23 +4089,20 @@ mmio:
 	val += handled;
 
 	vcpu->mmio_needed = 1;
-	memcpy(vcpu->mmio_data, val, bytes);
 	vcpu->run->exit_reason = KVM_EXIT_MMIO;
 	vcpu->run->mmio.phys_addr = vcpu->mmio_phys_addr = gpa;
 	vcpu->mmio_size = bytes;
 	vcpu->run->mmio.len = min(vcpu->mmio_size, 8);
-	vcpu->run->mmio.is_write = vcpu->mmio_is_write = 1;
-	memcpy(vcpu->run->mmio.data, vcpu->mmio_data, 8);
+	vcpu->run->mmio.is_write = vcpu->mmio_is_write = write;
 	vcpu->mmio_index = 0;
 
-	return X86EMUL_CONTINUE;
+	return ops->read_write_exit_mmio(vcpu, gpa, val, bytes);
 }
 
-int emulator_write_emulated(struct x86_emulate_ctxt *ctxt,
-			    unsigned long addr,
-			    const void *val,
-			    unsigned int bytes,
-			    struct x86_exception *exception)
+int emulator_read_write(struct x86_emulate_ctxt *ctxt, unsigned long addr,
+			void *val, unsigned int bytes,
+			struct x86_exception *exception,
+			struct read_write_emulator_ops *ops)
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
 
@@ -4188,16 +4111,38 @@ int emulator_write_emulated(struct x86_emulate_ctxt *ctxt,
 		int rc, now;
 
 		now = -addr & ~PAGE_MASK;
-		rc = emulator_write_emulated_onepage(addr, val, now, exception,
-						     vcpu);
+		rc = emulator_read_write_onepage(addr, val, now, exception,
+						 vcpu, ops);
+
 		if (rc != X86EMUL_CONTINUE)
 			return rc;
 		addr += now;
 		val += now;
 		bytes -= now;
 	}
-	return emulator_write_emulated_onepage(addr, val, bytes, exception,
-					       vcpu);
+
+	return emulator_read_write_onepage(addr, val, bytes, exception,
+					   vcpu, ops);
+}
+
+static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
+				  unsigned long addr,
+				  void *val,
+				  unsigned int bytes,
+				  struct x86_exception *exception)
+{
+	return emulator_read_write(ctxt, addr, val, bytes,
+				   exception, &read_emultor);
+}
+
+int emulator_write_emulated(struct x86_emulate_ctxt *ctxt,
+			    unsigned long addr,
+			    const void *val,
+			    unsigned int bytes,
+			    struct x86_exception *exception)
+{
+	return emulator_read_write(ctxt, addr, (void *)val, bytes,
+				   exception, &write_emultor);
 }
 
 #define CMPXCHG_TYPE(t, ptr, old, new) \
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (5 preceding siblings ...)
  2011-06-22 14:30 ` [PATCH v2 06/22] KVM: x86: cleanup the code of " Xiao Guangrong
@ 2011-06-22 14:31 ` Xiao Guangrong
  2011-06-29  8:48   ` Avi Kivity
  2011-06-22 14:31 ` [PATCH v2 08/22] KVM: MMU: optimize to handle dirty bit Xiao Guangrong
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

If the page fault is caused by mmio, we can cache the mmio info, later, we do
not need to walk guest page table and quickly know it is a mmio fault while we
emulate the mmio instruction

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    5 +++++
 arch/x86/kvm/mmu.c              |   21 +++++++--------------
 arch/x86/kvm/mmu.h              |   23 +++++++++++++++++++++++
 arch/x86/kvm/paging_tmpl.h      |   21 ++++++++++++++-------
 arch/x86/kvm/x86.c              |   11 +++++++++++
 arch/x86/kvm/x86.h              |   36 ++++++++++++++++++++++++++++++++++++
 6 files changed, 96 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..7b0834a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -415,6 +415,11 @@ struct kvm_vcpu_arch {
 	u64 mcg_ctl;
 	u64 *mce_banks;
 
+	/* Cache MMIO info */
+	u64 mmio_gva;
+	unsigned access;
+	gfn_t mmio_gfn;
+
 	/* used for guest single stepping over the given code position */
 	unsigned long singlestep_rip;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8316c2d..7f53210 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -217,11 +217,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
 
-static bool is_write_protection(struct kvm_vcpu *vcpu)
-{
-	return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
-}
-
 static int is_cpuid_PSE36(void)
 {
 	return 1;
@@ -243,11 +238,6 @@ static int is_large_pte(u64 pte)
 	return pte & PT_PAGE_SIZE_MASK;
 }
 
-static int is_writable_pte(unsigned long pte)
-{
-	return pte & PT_WRITABLE_MASK;
-}
-
 static int is_dirty_gpte(unsigned long pte)
 {
 	return pte & PT_DIRTY_MASK;
@@ -2247,15 +2237,17 @@ static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *
 	send_sig_info(SIGBUS, &info, tsk);
 }
 
-static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
+static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gva_t gva,
+			       unsigned access, gfn_t gfn, pfn_t pfn)
 {
 	kvm_release_pfn_clean(pfn);
 	if (is_hwpoison_pfn(pfn)) {
-		kvm_send_hwpoison_signal(gfn_to_hva(kvm, gfn), current);
+		kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current);
 		return 0;
 	} else if (is_fault_pfn(pfn))
 		return -EFAULT;
 
+	vcpu_cache_mmio_info(vcpu, gva, gfn, access);
 	return 1;
 }
 
@@ -2337,7 +2329,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn,
 
 	/* mmio */
 	if (is_error_pfn(pfn))
-		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
+		return kvm_handle_bad_page(vcpu, v, ACC_ALL, gfn, pfn);
 
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu, mmu_seq))
@@ -2564,6 +2556,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
+	vcpu_clear_mmio_info(vcpu, ~0ul);
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
@@ -2710,7 +2703,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 
 	/* mmio */
 	if (is_error_pfn(pfn))
-		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
+		return kvm_handle_bad_page(vcpu, 0, 0, gfn, pfn);
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu, mmu_seq))
 		goto out_unlock;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 7086ca8..05310b1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -76,4 +76,27 @@ static inline int is_present_gpte(unsigned long pte)
 	return pte & PT_PRESENT_MASK;
 }
 
+static inline int is_writable_pte(unsigned long pte)
+{
+	return pte & PT_WRITABLE_MASK;
+}
+
+static inline bool is_write_protection(struct kvm_vcpu *vcpu)
+{
+	return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
+}
+
+static inline bool check_write_user_access(struct kvm_vcpu *vcpu,
+					   bool write_fault, bool user_fault,
+					   unsigned long pte)
+{
+	if (unlikely(write_fault && !is_writable_pte(pte)
+	      && (user_fault || is_write_protection(vcpu))))
+		return false;
+
+	if (unlikely(user_fault && !(pte & PT_USER_MASK)))
+		return false;
+
+	return true;
+}
 #endif
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1caeb4d..13978dc 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -201,11 +201,8 @@ walk:
 			break;
 		}
 
-		if (unlikely(write_fault && !is_writable_pte(pte)
-			     && (user_fault || is_write_protection(vcpu))))
-			eperm = true;
-
-		if (unlikely(user_fault && !(pte & PT_USER_MASK)))
+		if (!check_write_user_access(vcpu, write_fault, user_fault,
+					  pte))
 			eperm = true;
 
 #if PTTYPE == 64
@@ -631,8 +628,16 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 		return 0;
 
 	/* mmio */
-	if (is_error_pfn(pfn))
-		return kvm_handle_bad_page(vcpu->kvm, walker.gfn, pfn);
+	if (is_error_pfn(pfn)) {
+		unsigned access = walker.pte_access;
+		bool dirty = is_dirty_gpte(walker.ptes[walker.level - 1]);
+
+		if (dirty)
+			access &= ~ACC_WRITE_MASK;
+
+		return kvm_handle_bad_page(vcpu, mmu_is_nested(vcpu) ? 0 :
+					   addr, access, walker.gfn, pfn);
+	}
 
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu, mmu_seq))
@@ -672,6 +677,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 	u64 *sptep;
 	int need_flush = 0;
 
+	vcpu_clear_mmio_info(vcpu, gva);
+
 	spin_lock(&vcpu->kvm->mmu_lock);
 
 	for_each_shadow_entry(vcpu, gva, iterator) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index baa5a11..40ffbc5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3950,6 +3950,14 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 {
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 
+	if (vcpu_match_mmio_gva(vcpu, gva) &&
+		  check_write_user_access(vcpu, write, access,
+		  vcpu->arch.access)) {
+		*gpa = vcpu->arch.mmio_gfn << PAGE_SHIFT |
+					(gva & (PAGE_SIZE - 1));
+		return 1;
+	}
+
 	if (write)
 		access |= PFERR_WRITE_MASK;
 
@@ -3962,6 +3970,9 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 	if ((*gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
 		return 1;
 
+	if (vcpu_match_mmio_gpa(vcpu, *gpa))
+		return 1;
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 256da82..d36fe23 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -75,6 +75,42 @@ static inline u32 bit(int bitno)
 	return 1 << (bitno & 31);
 }
 
+static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
+					gva_t gva, gfn_t gfn, unsigned access)
+{
+	vcpu->arch.mmio_gva = gva & PAGE_MASK;
+	vcpu->arch.access = access;
+	vcpu->arch.mmio_gfn = gfn;
+}
+
+/*
+ * Clear the mmio cache info for the given gva,
+ * specially, if gva is ~0ul, we clear all mmio cache info.
+ */
+static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
+{
+	if (gva != (~0ul) && vcpu->arch.mmio_gva != (gva & PAGE_MASK))
+		return;
+
+	vcpu->arch.mmio_gva = 0;
+}
+
+static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long gva)
+{
+	if (vcpu->arch.mmio_gva && vcpu->arch.mmio_gva == (gva & PAGE_MASK))
+		return true;
+
+	return false;
+}
+
+static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	if (vcpu->arch.mmio_gfn && vcpu->arch.mmio_gfn == gpa >> PAGE_SHIFT)
+		return true;
+
+	return false;
+}
+
 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 08/22] KVM: MMU: optimize to handle dirty bit
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (6 preceding siblings ...)
  2011-06-22 14:31 ` [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path Xiao Guangrong
@ 2011-06-22 14:31 ` Xiao Guangrong
  2011-06-22 14:31 ` [PATCH v2 09/22] KVM: MMU: cleanup for FNAME(fetch) Xiao Guangrong
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

If dirty bit is not set, we can make the pte access read-only to avoid handing
dirty bit everywhere

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |   13 +++++------
 arch/x86/kvm/paging_tmpl.h |   46 ++++++++++++++++++-------------------------
 2 files changed, 25 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7f53210..9e35577 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1923,7 +1923,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		    unsigned pte_access, int user_fault,
-		    int write_fault, int dirty, int level,
+		    int write_fault, int level,
 		    gfn_t gfn, pfn_t pfn, bool speculative,
 		    bool can_unsync, bool host_writable)
 {
@@ -1938,8 +1938,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	spte = PT_PRESENT_MASK;
 	if (!speculative)
 		spte |= shadow_accessed_mask;
-	if (!dirty)
-		pte_access &= ~ACC_WRITE_MASK;
+
 	if (pte_access & ACC_EXEC_MASK)
 		spte |= shadow_x_mask;
 	else
@@ -2023,7 +2022,7 @@ done:
 
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			 unsigned pt_access, unsigned pte_access,
-			 int user_fault, int write_fault, int dirty,
+			 int user_fault, int write_fault,
 			 int *ptwrite, int level, gfn_t gfn,
 			 pfn_t pfn, bool speculative,
 			 bool host_writable)
@@ -2059,7 +2058,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	}
 
 	if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
-		      dirty, level, gfn, pfn, speculative, true,
+		      level, gfn, pfn, speculative, true,
 		      host_writable)) {
 		if (write_fault)
 			*ptwrite = 1;
@@ -2129,7 +2128,7 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 
 	for (i = 0; i < ret; i++, gfn++, start++)
 		mmu_set_spte(vcpu, start, ACC_ALL,
-			     access, 0, 0, 1, NULL,
+			     access, 0, 0, NULL,
 			     sp->role.level, gfn,
 			     page_to_pfn(pages[i]), true, true);
 
@@ -2193,7 +2192,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 			unsigned pte_access = ACC_ALL;
 
 			mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
-				     0, write, 1, &pt_write,
+				     0, write, &pt_write,
 				     level, gfn, pfn, prefault, map_writable);
 			direct_pte_prefetch(vcpu, iterator.sptep);
 			++vcpu->stat.pf_fixed;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 13978dc..e06c9e4 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -101,11 +101,15 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return (ret != orig_pte);
 }
 
-static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte)
+static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
+				   bool last)
 {
 	unsigned access;
 
 	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+	if (last && !is_dirty_gpte(gpte))
+		access &= ~ACC_WRITE_MASK;
+
 #if PTTYPE == 64
 	if (vcpu->arch.mmu.nx)
 		access &= ~(gpte >> PT64_NX_SHIFT);
@@ -227,8 +231,6 @@ walk:
 			pte |= PT_ACCESSED_MASK;
 		}
 
-		pte_access = pt_access & FNAME(gpte_access)(vcpu, pte);
-
 		walker->ptes[walker->level - 1] = pte;
 
 		if ((walker->level == PT_PAGE_TABLE_LEVEL) ||
@@ -269,7 +271,7 @@ walk:
 			break;
 		}
 
-		pt_access = pte_access;
+		pt_access &= FNAME(gpte_access)(vcpu, pte, false);
 		--walker->level;
 	}
 
@@ -293,6 +295,7 @@ walk:
 		walker->ptes[walker->level - 1] = pte;
 	}
 
+	pte_access = pt_access & FNAME(gpte_access)(vcpu, pte, true);
 	walker->pt_access = pt_access;
 	walker->pte_access = pte_access;
 	pgprintk("%s: pte %llx pte_access %x pt_access %x\n",
@@ -373,7 +376,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		return;
 
 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
-	pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
+	pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte, true);
 	pfn = gfn_to_pfn_atomic(vcpu->kvm, gpte_to_gfn(gpte));
 	if (is_error_pfn(pfn)) {
 		kvm_release_pfn_clean(pfn);
@@ -385,7 +388,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 * vcpu->arch.update_pte.pfn was fetched from get_user_pages(write = 1).
 	 */
 	mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
-		     is_dirty_gpte(gpte), NULL, PT_PAGE_TABLE_LEVEL,
+		     NULL, PT_PAGE_TABLE_LEVEL,
 		     gpte_to_gfn(gpte), pfn, true, true);
 }
 
@@ -436,7 +439,6 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 		unsigned pte_access;
 		gfn_t gfn;
 		pfn_t pfn;
-		bool dirty;
 
 		if (spte == sptep)
 			continue;
@@ -449,18 +451,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 			continue;
 
-		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
+		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte,
+								  true);
 		gfn = gpte_to_gfn(gpte);
-		dirty = is_dirty_gpte(gpte);
 		pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
-				      (pte_access & ACC_WRITE_MASK) && dirty);
+				      pte_access & ACC_WRITE_MASK);
 		if (is_error_pfn(pfn)) {
 			kvm_release_pfn_clean(pfn);
 			break;
 		}
 
 		mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
-			     dirty, NULL, PT_PAGE_TABLE_LEVEL, gfn,
+			     NULL, PT_PAGE_TABLE_LEVEL, gfn,
 			     pfn, true, true);
 	}
 }
@@ -476,7 +478,6 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 {
 	unsigned access = gw->pt_access;
 	struct kvm_mmu_page *sp = NULL;
-	bool dirty = is_dirty_gpte(gw->ptes[gw->level - 1]);
 	int top_level;
 	unsigned direct_access;
 	struct kvm_shadow_walk_iterator it;
@@ -485,8 +486,6 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 		return NULL;
 
 	direct_access = gw->pt_access & gw->pte_access;
-	if (!dirty)
-		direct_access &= ~ACC_WRITE_MASK;
 
 	top_level = vcpu->arch.mmu.root_level;
 	if (top_level == PT32E_ROOT_LEVEL)
@@ -545,7 +544,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	}
 
 	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access & access,
-		     user_fault, write_fault, dirty, ptwrite, it.level,
+		     user_fault, write_fault, ptwrite, it.level,
 		     gw->gfn, pfn, prefault, map_writable);
 	FNAME(pte_prefetch)(vcpu, gw, it.sptep);
 
@@ -628,17 +627,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 		return 0;
 
 	/* mmio */
-	if (is_error_pfn(pfn)) {
-		unsigned access = walker.pte_access;
-		bool dirty = is_dirty_gpte(walker.ptes[walker.level - 1]);
-
-		if (dirty)
-			access &= ~ACC_WRITE_MASK;
-
+	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu, mmu_is_nested(vcpu) ? 0 :
-					   addr, access, walker.gfn, pfn);
-	}
-
+				      addr, walker.pte_access, walker.gfn, pfn);
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu, mmu_seq))
 		goto out_unlock;
@@ -855,11 +846,12 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		}
 
 		nr_present++;
-		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
+		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte,
+								  true);
 		host_writable = sp->spt[i] & SPTE_HOST_WRITEABLE;
 
 		set_spte(vcpu, &sp->spt[i], pte_access, 0, 0,
-			 is_dirty_gpte(gpte), PT_PAGE_TABLE_LEVEL, gfn,
+			 PT_PAGE_TABLE_LEVEL, gfn,
 			 spte_to_pfn(sp->spt[i]), true, false,
 			 host_writable);
 	}
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 09/22] KVM: MMU: cleanup for FNAME(fetch)
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (7 preceding siblings ...)
  2011-06-22 14:31 ` [PATCH v2 08/22] KVM: MMU: optimize to handle dirty bit Xiao Guangrong
@ 2011-06-22 14:31 ` Xiao Guangrong
  2011-06-22 14:32 ` [PATCH v2 10/22] KVM: MMU: rename 'pt_write' to 'emulate' Xiao Guangrong
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

gw->pte_access is the final access permission, since it is unified with
gw->pt_access when we walked guest page table:

FNAME(walk_addr_generic):
	pte_access = pt_access & FNAME(gpte_access)(vcpu, pte, true);

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/paging_tmpl.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e06c9e4..e9ee473 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -485,7 +485,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	if (!is_present_gpte(gw->ptes[gw->level - 1]))
 		return NULL;
 
-	direct_access = gw->pt_access & gw->pte_access;
+	direct_access = gw->pte_access;
 
 	top_level = vcpu->arch.mmu.root_level;
 	if (top_level == PT32E_ROOT_LEVEL)
@@ -543,7 +543,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 		link_shadow_page(it.sptep, sp);
 	}
 
-	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access & access,
+	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access,
 		     user_fault, write_fault, ptwrite, it.level,
 		     gw->gfn, pfn, prefault, map_writable);
 	FNAME(pte_prefetch)(vcpu, gw, it.sptep);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 10/22] KVM: MMU: rename 'pt_write' to 'emulate'
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (8 preceding siblings ...)
  2011-06-22 14:31 ` [PATCH v2 09/22] KVM: MMU: cleanup for FNAME(fetch) Xiao Guangrong
@ 2011-06-22 14:32 ` Xiao Guangrong
  2011-06-22 14:32 ` [PATCH v2 11/22] KVM: MMU: count used shadow pages on prepareing path Xiao Guangrong
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

If 'pt_write' is true, we need to emulate the fault. And in later patch, we
need to emulate the fault even though it is not a pt_write event, so rename
it to better fit the meaning

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |   10 +++++-----
 arch/x86/kvm/paging_tmpl.h |   16 ++++++++--------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9e35577..580ecf0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2023,7 +2023,7 @@ done:
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 			 unsigned pt_access, unsigned pte_access,
 			 int user_fault, int write_fault,
-			 int *ptwrite, int level, gfn_t gfn,
+			 int *emulate, int level, gfn_t gfn,
 			 pfn_t pfn, bool speculative,
 			 bool host_writable)
 {
@@ -2061,7 +2061,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		      level, gfn, pfn, speculative, true,
 		      host_writable)) {
 		if (write_fault)
-			*ptwrite = 1;
+			*emulate = 1;
 		kvm_mmu_flush_tlb(vcpu);
 	}
 
@@ -2184,7 +2184,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
-	int pt_write = 0;
+	int emulate = 0;
 	gfn_t pseudo_gfn;
 
 	for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
@@ -2192,7 +2192,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 			unsigned pte_access = ACC_ALL;
 
 			mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
-				     0, write, &pt_write,
+				     0, write, &emulate,
 				     level, gfn, pfn, prefault, map_writable);
 			direct_pte_prefetch(vcpu, iterator.sptep);
 			++vcpu->stat.pf_fixed;
@@ -2220,7 +2220,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 				   | shadow_accessed_mask);
 		}
 	}
-	return pt_write;
+	return emulate;
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *tsk)
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e9ee473..1e534e0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -473,7 +473,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			 struct guest_walker *gw,
 			 int user_fault, int write_fault, int hlevel,
-			 int *ptwrite, pfn_t pfn, bool map_writable,
+			 int *emulate, pfn_t pfn, bool map_writable,
 			 bool prefault)
 {
 	unsigned access = gw->pt_access;
@@ -544,7 +544,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	}
 
 	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access,
-		     user_fault, write_fault, ptwrite, it.level,
+		     user_fault, write_fault, emulate, it.level,
 		     gw->gfn, pfn, prefault, map_writable);
 	FNAME(pte_prefetch)(vcpu, gw, it.sptep);
 
@@ -578,7 +578,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	int user_fault = error_code & PFERR_USER_MASK;
 	struct guest_walker walker;
 	u64 *sptep;
-	int write_pt = 0;
+	int emulate = 0;
 	int r;
 	pfn_t pfn;
 	int level = PT_PAGE_TABLE_LEVEL;
@@ -639,19 +639,19 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	if (!force_pt_level)
 		transparent_hugepage_adjust(vcpu, &walker.gfn, &pfn, &level);
 	sptep = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault,
-			     level, &write_pt, pfn, map_writable, prefault);
+			     level, &emulate, pfn, map_writable, prefault);
 	(void)sptep;
-	pgprintk("%s: shadow pte %p %llx ptwrite %d\n", __func__,
-		 sptep, *sptep, write_pt);
+	pgprintk("%s: shadow pte %p %llx emulate %d\n", __func__,
+		 sptep, *sptep, emulate);
 
-	if (!write_pt)
+	if (!emulate)
 		vcpu->arch.last_pt_write_count = 0; /* reset fork detector */
 
 	++vcpu->stat.pf_fixed;
 	trace_kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
-	return write_pt;
+	return emulate;
 
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 11/22] KVM: MMU: count used shadow pages on prepareing path
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (9 preceding siblings ...)
  2011-06-22 14:32 ` [PATCH v2 10/22] KVM: MMU: rename 'pt_write' to 'emulate' Xiao Guangrong
@ 2011-06-22 14:32 ` Xiao Guangrong
  2011-06-22 14:32 ` [PATCH v2 12/22] KVM: MMU: split kvm_mmu_free_page Xiao Guangrong
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Move counting used shadow pages from commiting path to preparing path to
reduce tlb flush on some paths

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 580ecf0..8233aba 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1039,7 +1039,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
 	ASSERT(is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -1048,7 +1048,6 @@ static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
 	kmem_cache_free(mmu_page_header_cache, sp);
-	kvm_mod_used_mmu_pages(kvm, -1);
 }
 
 static unsigned kvm_page_table_hashfn(gfn_t gfn)
@@ -1655,6 +1654,7 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 		/* Count self */
 		ret++;
 		list_move(&sp->link, invalid_list);
+		kvm_mod_used_mmu_pages(kvm, -1);
 	} else {
 		list_move(&sp->link, &kvm->arch.active_mmu_pages);
 		kvm_reload_remote_mmus(kvm);
@@ -1678,7 +1678,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 	do {
 		sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(kvm, sp);
+		kvm_mmu_free_page(sp);
 	} while (!list_empty(invalid_list));
 
 }
@@ -1704,8 +1704,8 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int goal_nr_mmu_pages)
 			page = container_of(kvm->arch.active_mmu_pages.prev,
 					    struct kvm_mmu_page, link);
 			kvm_mmu_prepare_zap_page(kvm, page, &invalid_list);
-			kvm_mmu_commit_zap_page(kvm, &invalid_list);
 		}
+		kvm_mmu_commit_zap_page(kvm, &invalid_list);
 		goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages;
 	}
 
@@ -3312,9 +3312,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
 		sp = container_of(vcpu->kvm->arch.active_mmu_pages.prev,
 				  struct kvm_mmu_page, link);
 		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
-		kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 		++vcpu->kvm->stat.mmu_recycled;
 	}
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 }
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 12/22] KVM: MMU: split kvm_mmu_free_page
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (10 preceding siblings ...)
  2011-06-22 14:32 ` [PATCH v2 11/22] KVM: MMU: count used shadow pages on prepareing path Xiao Guangrong
@ 2011-06-22 14:32 ` Xiao Guangrong
  2011-06-22 14:33 ` [PATCH v2 13/22] KVM: MMU: remove bypass_guest_pf Xiao Guangrong
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Split kvm_mmu_free_page to kvm_mmu_isolate_page and
kvm_mmu_free_page

One is used to remove the page from cache under mmu lock and the other is
used to free page table out of mmu lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |   21 ++++++++++++++++++---
 1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8233aba..c6fabf0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1039,14 +1039,28 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+/*
+ * Remove the sp from shadow page cache, after call it,
+ * we can not find this sp from the cache, and the shadow
+ * page table is still valid.
+ * It should be under the protection of mmu lock.
+ */
+static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp)
 {
 	ASSERT(is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
-	list_del(&sp->link);
-	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
+}
+
+/*
+ * Free the shadow page table and the sp, we can do it
+ * out of the protection of mmu lock.
+ */
+static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+{
+	list_del(&sp->link);
+	free_page((unsigned long)sp->spt);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -1678,6 +1692,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 	do {
 		sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 		WARN_ON(!sp->role.invalid || sp->root_count);
+		kvm_mmu_isolate_page(sp);
 		kvm_mmu_free_page(sp);
 	} while (!list_empty(invalid_list));
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 13/22] KVM: MMU: remove bypass_guest_pf
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (11 preceding siblings ...)
  2011-06-22 14:32 ` [PATCH v2 12/22] KVM: MMU: split kvm_mmu_free_page Xiao Guangrong
@ 2011-06-22 14:33 ` Xiao Guangrong
  2011-06-22 14:33 ` [PATCH v2 14/22] KVM: MMU: filter out the mmio pfn from the fault pfn Xiao Guangrong
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

The idea is from Avi:
| Maybe it's time to kill off bypass_guest_pf=1.  It's not as effective as
| it used to be, since unsync pages always use shadow_trap_nonpresent_pte,
| and since we convert between the two nonpresent_ptes during sync and unsync.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    4 --
 arch/x86/include/asm/kvm_host.h     |    3 -
 arch/x86/kvm/mmu.c                  |   83 ++++++++++-------------------------
 arch/x86/kvm/mmu_audit.c            |   12 -----
 arch/x86/kvm/paging_tmpl.h          |   51 +++------------------
 arch/x86/kvm/vmx.c                  |   11 +----
 arch/x86/kvm/x86.c                  |    1 -
 7 files changed, 33 insertions(+), 132 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index fd248a31..a06e4f1 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1159,10 +1159,6 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			for all guests.
 			Default is 1 (enabled) if in 64bit or 32bit-PAE mode
 
-	kvm-intel.bypass_guest_pf=
-			[KVM,Intel] Disables bypassing of guest page faults
-			on Intel chips. Default is 1 (enabled)
-
 	kvm-intel.ept=	[KVM,Intel] Disable extended page tables
 			(virtualized MMU) support on capable Intel chips.
 			Default is 1 (enabled)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7b0834a..42e577d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -266,8 +266,6 @@ struct kvm_mmu {
 	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
 			    struct x86_exception *exception);
 	gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access);
-	void (*prefetch_page)(struct kvm_vcpu *vcpu,
-			      struct kvm_mmu_page *page);
 	int (*sync_page)(struct kvm_vcpu *vcpu,
 			 struct kvm_mmu_page *sp);
 	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva);
@@ -638,7 +636,6 @@ void kvm_mmu_module_exit(void);
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
 int kvm_mmu_setup(struct kvm_vcpu *vcpu);
-void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte);
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 		u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c6fabf0..2c7b7a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -186,8 +186,6 @@ static struct kmem_cache *pte_list_desc_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
 
-static u64 __read_mostly shadow_trap_nonpresent_pte;
-static u64 __read_mostly shadow_notrap_nonpresent_pte;
 static u64 __read_mostly shadow_nx_mask;
 static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
 static u64 __read_mostly shadow_user_mask;
@@ -199,13 +197,6 @@ static inline u64 rsvd_bits(int s, int e)
 	return ((1ULL << (e - s + 1)) - 1) << s;
 }
 
-void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte)
-{
-	shadow_trap_nonpresent_pte = trap_pte;
-	shadow_notrap_nonpresent_pte = notrap_pte;
-}
-EXPORT_SYMBOL_GPL(kvm_mmu_set_nonpresent_ptes);
-
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 		u64 dirty_mask, u64 nx_mask, u64 x_mask)
 {
@@ -229,8 +220,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
 
 static int is_shadow_present_pte(u64 pte)
 {
-	return pte != shadow_trap_nonpresent_pte
-		&& pte != shadow_notrap_nonpresent_pte;
+	return pte != 0ull;
 }
 
 static int is_large_pte(u64 pte)
@@ -777,9 +767,9 @@ static int set_spte_track_bits(u64 *sptep, u64 new_spte)
 	return 1;
 }
 
-static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte)
+static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
-	if (set_spte_track_bits(sptep, new_spte))
+	if (set_spte_track_bits(sptep, 0ull))
 		rmap_remove(kvm, sptep);
 }
 
@@ -814,8 +804,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 			BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
 			pgprintk("rmap_write_protect(large): spte %p %llx %lld\n", spte, *spte, gfn);
 			if (is_writable_pte(*spte)) {
-				drop_spte(kvm, spte,
-					  shadow_trap_nonpresent_pte);
+				drop_spte(kvm, spte);
 				--kvm->stat.lpages;
 				spte = NULL;
 				write_protected = 1;
@@ -836,7 +825,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	while ((spte = rmap_next(kvm, rmapp, NULL))) {
 		BUG_ON(!(*spte & PT_PRESENT_MASK));
 		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
-		drop_spte(kvm, spte, shadow_trap_nonpresent_pte);
+		drop_spte(kvm, spte);
 		need_tlb_flush = 1;
 	}
 	return need_tlb_flush;
@@ -858,7 +847,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
 		rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", spte, *spte);
 		need_flush = 1;
 		if (pte_write(*ptep)) {
-			drop_spte(kvm, spte, shadow_trap_nonpresent_pte);
+			drop_spte(kvm, spte);
 			spte = rmap_next(kvm, rmapp, NULL);
 		} else {
 			new_spte = *spte &~ (PT64_BASE_ADDR_MASK);
@@ -1088,7 +1077,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 			    u64 *parent_pte)
 {
 	mmu_page_remove_parent_pte(sp, parent_pte);
-	__set_spte(parent_pte, shadow_trap_nonpresent_pte);
+	__set_spte(parent_pte, 0ull);
 }
 
 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
@@ -1130,15 +1119,6 @@ static void mark_unsync(u64 *spte)
 	kvm_mmu_mark_parents_unsync(sp);
 }
 
-static void nonpaging_prefetch_page(struct kvm_vcpu *vcpu,
-				    struct kvm_mmu_page *sp)
-{
-	int i;
-
-	for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
-		sp->spt[i] = shadow_trap_nonpresent_pte;
-}
-
 static int nonpaging_sync_page(struct kvm_vcpu *vcpu,
 			       struct kvm_mmu_page *sp)
 {
@@ -1420,6 +1400,14 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 	}
 }
 
+static void init_shadow_page_table(struct kvm_mmu_page *sp)
+{
+	int i;
+
+	for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
+		sp->spt[i] = 0ull;
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     gfn_t gfn,
 					     gva_t gaddr,
@@ -1482,10 +1470,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 		account_shadowed(vcpu->kvm, gfn);
 	}
-	if (shadow_trap_nonpresent_pte != shadow_notrap_nonpresent_pte)
-		vcpu->arch.mmu.prefetch_page(vcpu, sp);
-	else
-		nonpaging_prefetch_page(vcpu, sp);
+	init_shadow_page_table(sp);
 	trace_kvm_mmu_get_page(sp, true);
 	return sp;
 }
@@ -1546,7 +1531,7 @@ static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 {
 	if (is_large_pte(*sptep)) {
-		drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte);
+		drop_spte(vcpu->kvm, sptep);
 		kvm_flush_remote_tlbs(vcpu->kvm);
 	}
 }
@@ -1582,13 +1567,13 @@ static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	pte = *spte;
 	if (is_shadow_present_pte(pte)) {
 		if (is_last_spte(pte, sp->role.level))
-			drop_spte(kvm, spte, shadow_trap_nonpresent_pte);
+			drop_spte(kvm, spte);
 		else {
 			child = page_header(pte & PT64_BASE_ADDR_MASK);
 			drop_parent_pte(child, spte);
 		}
 	}
-	__set_spte(spte, shadow_trap_nonpresent_pte);
+
 	if (is_large_pte(pte))
 		--kvm->stat.lpages;
 }
@@ -1769,20 +1754,6 @@ static void page_header_update_slot(struct kvm *kvm, void *pte, gfn_t gfn)
 	__set_bit(slot, sp->slot_bitmap);
 }
 
-static void mmu_convert_notrap(struct kvm_mmu_page *sp)
-{
-	int i;
-	u64 *pt = sp->spt;
-
-	if (shadow_trap_nonpresent_pte == shadow_notrap_nonpresent_pte)
-		return;
-
-	for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
-		if (pt[i] == shadow_notrap_nonpresent_pte)
-			__set_spte(&pt[i], shadow_trap_nonpresent_pte);
-	}
-}
-
 /*
  * The function is based on mtrr_type_lookup() in
  * arch/x86/kernel/cpu/mtrr/generic.c
@@ -1895,7 +1866,6 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 	sp->unsync = 1;
 
 	kvm_mmu_mark_parents_unsync(sp);
-	mmu_convert_notrap(sp);
 }
 
 static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
@@ -1980,7 +1950,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		if (level > PT_PAGE_TABLE_LEVEL &&
 		    has_wrprotected_page(vcpu->kvm, gfn, level)) {
 			ret = 1;
-			drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte);
+			drop_spte(vcpu->kvm, sptep);
 			goto done;
 		}
 
@@ -2066,7 +2036,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		} else if (pfn != spte_to_pfn(*sptep)) {
 			pgprintk("hfn old %llx new %llx\n",
 				 spte_to_pfn(*sptep), pfn);
-			drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte);
+			drop_spte(vcpu->kvm, sptep);
 			kvm_flush_remote_tlbs(vcpu->kvm);
 		} else
 			was_rmapped = 1;
@@ -2162,7 +2132,7 @@ static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
 	spte = sp->spt + i;
 
 	for (i = 0; i < PTE_PREFETCH_NUM; i++, spte++) {
-		if (*spte != shadow_trap_nonpresent_pte || spte == sptep) {
+		if (is_shadow_present_pte(*spte) || spte == sptep) {
 			if (!start)
 				continue;
 			if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0)
@@ -2214,7 +2184,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 			break;
 		}
 
-		if (*iterator.sptep == shadow_trap_nonpresent_pte) {
+		if (!is_shadow_present_pte(*iterator.sptep)) {
 			u64 base_addr = iterator.addr;
 
 			base_addr &= PT64_LVL_ADDR_MASK(iterator.level);
@@ -2748,7 +2718,6 @@ static int nonpaging_init_context(struct kvm_vcpu *vcpu,
 	context->page_fault = nonpaging_page_fault;
 	context->gva_to_gpa = nonpaging_gva_to_gpa;
 	context->free = nonpaging_free;
-	context->prefetch_page = nonpaging_prefetch_page;
 	context->sync_page = nonpaging_sync_page;
 	context->invlpg = nonpaging_invlpg;
 	context->update_pte = nonpaging_update_pte;
@@ -2878,7 +2847,6 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu,
 	context->new_cr3 = paging_new_cr3;
 	context->page_fault = paging64_page_fault;
 	context->gva_to_gpa = paging64_gva_to_gpa;
-	context->prefetch_page = paging64_prefetch_page;
 	context->sync_page = paging64_sync_page;
 	context->invlpg = paging64_invlpg;
 	context->update_pte = paging64_update_pte;
@@ -2907,7 +2875,6 @@ static int paging32_init_context(struct kvm_vcpu *vcpu,
 	context->page_fault = paging32_page_fault;
 	context->gva_to_gpa = paging32_gva_to_gpa;
 	context->free = paging_free;
-	context->prefetch_page = paging32_prefetch_page;
 	context->sync_page = paging32_sync_page;
 	context->invlpg = paging32_invlpg;
 	context->update_pte = paging32_update_pte;
@@ -2932,7 +2899,6 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 	context->new_cr3 = nonpaging_new_cr3;
 	context->page_fault = tdp_page_fault;
 	context->free = nonpaging_free;
-	context->prefetch_page = nonpaging_prefetch_page;
 	context->sync_page = nonpaging_sync_page;
 	context->invlpg = nonpaging_invlpg;
 	context->update_pte = nonpaging_update_pte;
@@ -3453,8 +3419,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 				continue;
 
 			if (is_large_pte(pt[i])) {
-				drop_spte(kvm, &pt[i],
-					  shadow_trap_nonpresent_pte);
+				drop_spte(kvm, &pt[i]);
 				--kvm->stat.lpages;
 				continue;
 			}
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 5f6223b..2460a26 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -99,18 +99,6 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
 				     "level = %d\n", sp, level);
 			return;
 		}
-
-		if (*sptep == shadow_notrap_nonpresent_pte) {
-			audit_printk(vcpu->kvm, "notrap spte in unsync "
-				     "sp: %p\n", sp);
-			return;
-		}
-	}
-
-	if (sp->role.direct && *sptep == shadow_notrap_nonpresent_pte) {
-		audit_printk(vcpu->kvm, "notrap spte in direct sp: %p\n",
-			     sp);
-		return;
 	}
 
 	if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep, level))
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1e534e0..58a29d0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -343,16 +343,11 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 				    struct kvm_mmu_page *sp, u64 *spte,
 				    pt_element_t gpte)
 {
-	u64 nonpresent = shadow_trap_nonpresent_pte;
-
 	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
 		goto no_present;
 
-	if (!is_present_gpte(gpte)) {
-		if (!sp->unsync)
-			nonpresent = shadow_notrap_nonpresent_pte;
+	if (!is_present_gpte(gpte))
 		goto no_present;
-	}
 
 	if (!(gpte & PT_ACCESSED_MASK))
 		goto no_present;
@@ -360,7 +355,7 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 	return false;
 
 no_present:
-	drop_spte(vcpu->kvm, spte, nonpresent);
+	drop_spte(vcpu->kvm, spte);
 	return true;
 }
 
@@ -443,7 +438,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 		if (spte == sptep)
 			continue;
 
-		if (*spte != shadow_trap_nonpresent_pte)
+		if (is_shadow_present_pte(*spte))
 			continue;
 
 		gpte = gptep[i];
@@ -693,11 +688,10 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 			if (is_shadow_present_pte(*sptep)) {
 				if (is_large_pte(*sptep))
 					--vcpu->kvm->stat.lpages;
-				drop_spte(vcpu->kvm, sptep,
-					  shadow_trap_nonpresent_pte);
+				drop_spte(vcpu->kvm, sptep);
 				need_flush = 1;
-			} else
-				__set_spte(sptep, shadow_trap_nonpresent_pte);
+			}
+
 			break;
 		}
 
@@ -757,36 +751,6 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 	return gpa;
 }
 
-static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu,
-				 struct kvm_mmu_page *sp)
-{
-	int i, j, offset, r;
-	pt_element_t pt[256 / sizeof(pt_element_t)];
-	gpa_t pte_gpa;
-
-	if (sp->role.direct
-	    || (PTTYPE == 32 && sp->role.level > PT_PAGE_TABLE_LEVEL)) {
-		nonpaging_prefetch_page(vcpu, sp);
-		return;
-	}
-
-	pte_gpa = gfn_to_gpa(sp->gfn);
-	if (PTTYPE == 32) {
-		offset = sp->role.quadrant << PT64_LEVEL_BITS;
-		pte_gpa += offset * sizeof(pt_element_t);
-	}
-
-	for (i = 0; i < PT64_ENT_PER_PAGE; i += ARRAY_SIZE(pt)) {
-		r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, pt, sizeof pt);
-		pte_gpa += ARRAY_SIZE(pt) * sizeof(pt_element_t);
-		for (j = 0; j < ARRAY_SIZE(pt); ++j)
-			if (r || is_present_gpte(pt[j]))
-				sp->spt[i+j] = shadow_trap_nonpresent_pte;
-			else
-				sp->spt[i+j] = shadow_notrap_nonpresent_pte;
-	}
-}
-
 /*
  * Using the cached information from sp->gfns is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
@@ -839,8 +803,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		}
 
 		if (gfn != sp->gfns[i]) {
-			drop_spte(vcpu->kvm, &sp->spt[i],
-				      shadow_trap_nonpresent_pte);
+			drop_spte(vcpu->kvm, &sp->spt[i]);
 			vcpu->kvm->tlbs_dirty++;
 			continue;
 		}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f5b49c7..a644acb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -49,9 +49,6 @@
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
 
-static int __read_mostly bypass_guest_pf = 1;
-module_param(bypass_guest_pf, bool, S_IRUGO);
-
 static int __read_mostly enable_vpid = 1;
 module_param_named(vpid, enable_vpid, bool, 0444);
 
@@ -3632,8 +3629,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 		vmcs_write32(PLE_WINDOW, ple_window);
 	}
 
-	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, !!bypass_guest_pf);
-	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
 	vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
 
 	vmcs_write16(HOST_FS_SELECTOR, 0);            /* 22.2.4 */
@@ -7103,16 +7100,12 @@ static int __init vmx_init(void)
 	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
 
 	if (enable_ept) {
-		bypass_guest_pf = 0;
 		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
 				VMX_EPT_EXECUTABLE_MASK);
 		kvm_enable_tdp();
 	} else
 		kvm_disable_tdp();
 
-	if (bypass_guest_pf)
-		kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull);
-
 	return 0;
 
 out3:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 40ffbc5..78d5519 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5065,7 +5065,6 @@ int kvm_arch_init(void *opaque)
 	kvm_init_msr_list();
 
 	kvm_x86_ops = ops;
-	kvm_mmu_set_nonpresent_ptes(0ull, 0ull);
 	kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK,
 			PT_DIRTY_MASK, PT64_NX_MASK, 0);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 14/22] KVM: MMU: filter out the mmio pfn from the fault pfn
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (12 preceding siblings ...)
  2011-06-22 14:33 ` [PATCH v2 13/22] KVM: MMU: remove bypass_guest_pf Xiao Guangrong
@ 2011-06-22 14:33 ` Xiao Guangrong
  2011-06-22 14:34 ` [PATCH v2 15/22] KVM: MMU: abstract some functions to handle " Xiao Guangrong
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

If the page fault is caused by mmio, the gfn can not be found in memslots, and
'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify
the mmio page fault.
And, to clarify the meaning of mmio pfn, we return fault page instead of bad
page when the gfn is not allowd to prefetch

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c       |    4 ++--
 include/linux/kvm_host.h |    5 +++++
 virt/kvm/kvm_main.c      |   16 ++++++++++++++--
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2c7b7a0..4a51042 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2085,8 +2085,8 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, no_dirty_log);
 	if (!slot) {
-		get_page(bad_page);
-		return page_to_pfn(bad_page);
+		get_page(fault_page);
+		return page_to_pfn(fault_page);
 	}
 
 	hva = gfn_to_hva_memslot(slot, gfn);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..3e548e8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -326,12 +326,17 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
 static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; }
 
 extern struct page *bad_page;
+extern struct page *fault_page;
+
 extern pfn_t bad_pfn;
+extern pfn_t fault_pfn;
 
 int is_error_page(struct page *page);
 int is_error_pfn(pfn_t pfn);
 int is_hwpoison_pfn(pfn_t pfn);
 int is_fault_pfn(pfn_t pfn);
+int is_mmio_pfn(pfn_t pfn);
+int is_invalid_pfn(pfn_t pfn);
 int kvm_is_error_hva(unsigned long addr);
 int kvm_set_memory_region(struct kvm *kvm,
 			  struct kvm_userspace_memory_region *mem,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 11d2783..c7d41d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -101,8 +101,8 @@ static bool largepages_enabled = true;
 static struct page *hwpoison_page;
 static pfn_t hwpoison_pfn;
 
-static struct page *fault_page;
-static pfn_t fault_pfn;
+struct page *fault_page;
+pfn_t fault_pfn;
 
 inline int kvm_is_mmio_pfn(pfn_t pfn)
 {
@@ -931,6 +931,18 @@ int is_fault_pfn(pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(is_fault_pfn);
 
+int is_mmio_pfn(pfn_t pfn)
+{
+	return pfn == bad_pfn;
+}
+EXPORT_SYMBOL_GPL(is_mmio_pfn);
+
+int is_invalid_pfn(pfn_t pfn)
+{
+	return pfn == hwpoison_pfn || pfn == fault_pfn;
+}
+EXPORT_SYMBOL_GPL(is_invalid_pfn);
+
 static inline unsigned long bad_hva(void)
 {
 	return PAGE_OFFSET;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 15/22] KVM: MMU: abstract some functions to handle fault pfn
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (13 preceding siblings ...)
  2011-06-22 14:33 ` [PATCH v2 14/22] KVM: MMU: filter out the mmio pfn from the fault pfn Xiao Guangrong
@ 2011-06-22 14:34 ` Xiao Guangrong
  2011-06-22 14:34 ` [PATCH v2 16/22] KVM: MMU: introduce the rules to modify shadow page table Xiao Guangrong
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Introduce handle_abnormal_pfn to handle fault pfn on page fault path,
introduce mmu_invalid_pfn to handle fault pfn on prefetch path

It is the preparing work for mmio page fault support

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |   47 ++++++++++++++++++++++++++++++++-----------
 arch/x86/kvm/paging_tmpl.h |   12 +++++-----
 2 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4a51042..4fe1cb3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2221,18 +2221,15 @@ static void kvm_send_hwpoison_signal(unsigned long address, struct task_struct *
 	send_sig_info(SIGBUS, &info, tsk);
 }
 
-static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gva_t gva,
-			       unsigned access, gfn_t gfn, pfn_t pfn)
+static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, pfn_t pfn)
 {
 	kvm_release_pfn_clean(pfn);
 	if (is_hwpoison_pfn(pfn)) {
 		kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current);
 		return 0;
-	} else if (is_fault_pfn(pfn))
-		return -EFAULT;
+	}
 
-	vcpu_cache_mmio_info(vcpu, gva, gfn, access);
-	return 1;
+	return -EFAULT;
 }
 
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
@@ -2277,6 +2274,33 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	}
 }
 
+static bool mmu_invalid_pfn(pfn_t pfn)
+{
+	return unlikely(is_invalid_pfn(pfn) || is_mmio_pfn(pfn));
+}
+
+static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
+				pfn_t pfn, unsigned access, int *ret_val)
+{
+	bool ret = true;
+
+	/* The pfn is invalid, report the error! */
+	if (unlikely(is_invalid_pfn(pfn))) {
+		*ret_val = kvm_handle_bad_page(vcpu, gfn, pfn);
+		goto exit;
+	}
+
+	if (unlikely(is_mmio_pfn(pfn))) {
+		vcpu_cache_mmio_info(vcpu, gva, gfn, access);
+		*ret_val = 1;
+		goto exit;
+	}
+
+	ret = false;
+exit:
+	return ret;
+}
+
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
 
@@ -2311,9 +2335,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn,
 	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
 		return 0;
 
-	/* mmio */
-	if (is_error_pfn(pfn))
-		return kvm_handle_bad_page(vcpu, v, ACC_ALL, gfn, pfn);
+	if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r))
+		return r;
 
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu, mmu_seq))
@@ -2685,9 +2708,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
 		return 0;
 
-	/* mmio */
-	if (is_error_pfn(pfn))
-		return kvm_handle_bad_page(vcpu, 0, 0, gfn, pfn);
+	if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r))
+		return r;
+
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu, mmu_seq))
 		goto out_unlock;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 58a29d0..870be69 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -373,7 +373,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
 	pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte, true);
 	pfn = gfn_to_pfn_atomic(vcpu->kvm, gpte_to_gfn(gpte));
-	if (is_error_pfn(pfn)) {
+	if (mmu_invalid_pfn(pfn)) {
 		kvm_release_pfn_clean(pfn);
 		return;
 	}
@@ -451,7 +451,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
 		gfn = gpte_to_gfn(gpte);
 		pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
 				      pte_access & ACC_WRITE_MASK);
-		if (is_error_pfn(pfn)) {
+		if (mmu_invalid_pfn(pfn)) {
 			kvm_release_pfn_clean(pfn);
 			break;
 		}
@@ -621,10 +621,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 			 &map_writable))
 		return 0;
 
-	/* mmio */
-	if (is_error_pfn(pfn))
-		return kvm_handle_bad_page(vcpu, mmu_is_nested(vcpu) ? 0 :
-				      addr, walker.pte_access, walker.gfn, pfn);
+	if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
+				walker.gfn, pfn, walker.pte_access, &r))
+		return r;
+
 	spin_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu, mmu_seq))
 		goto out_unlock;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 16/22] KVM: MMU: introduce the rules to modify shadow page table
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (14 preceding siblings ...)
  2011-06-22 14:34 ` [PATCH v2 15/22] KVM: MMU: abstract some functions to handle " Xiao Guangrong
@ 2011-06-22 14:34 ` Xiao Guangrong
  2011-06-22 14:34 ` [PATCH v2 17/22] KVM: MMU: clean up spte updating and clearing Xiao Guangrong
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Introduce some interfaces to modify spte as linux kernel does:
- mmu_spte_clear_track_bits, it set the spte from present to nonpresent, and
  track the stat bits(accessed/dirty) of spte
- mmu_spte_clear_no_track, the same as mmu_spte_clear_track_bits except
  tracking the stat bits
- mmu_spte_set, set spte from nonpresent to present
- mmu_spte_update, only update the stat bits

Now, it does not allowed to set spte from present to present, later, we can
drop the atomicly opration for X86_32 host, and it is the preparing work to
get spte on X86_32 host out of the mmu lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |  103 +++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 69 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4fe1cb3..5ceb64a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -299,12 +299,30 @@ static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
 	return (old_spte & bit_mask) && !(new_spte & bit_mask);
 }
 
-static void update_spte(u64 *sptep, u64 new_spte)
+/* Rules for using mmu_spte_set:
+ * Set the sptep from nonpresent to present.
+ * Note: the sptep being assigned *must* be either not present
+ * or in a state where the hardware will not attempt to update
+ * the spte.
+ */
+static void mmu_spte_set(u64 *sptep, u64 new_spte)
+{
+	WARN_ON(is_shadow_present_pte(*sptep));
+	__set_spte(sptep, new_spte);
+}
+
+/* Rules for using mmu_spte_update:
+ * Update the state bits, it means the mapped pfn is not changged.
+ */
+static void mmu_spte_update(u64 *sptep, u64 new_spte)
 {
 	u64 mask, old_spte = *sptep;
 
 	WARN_ON(!is_rmap_spte(new_spte));
 
+	if (!is_shadow_present_pte(old_spte))
+		return mmu_spte_set(sptep, new_spte);
+
 	new_spte |= old_spte & shadow_dirty_mask;
 
 	mask = shadow_accessed_mask;
@@ -325,6 +343,42 @@ static void update_spte(u64 *sptep, u64 new_spte)
 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
 }
 
+/*
+ * Rules for using mmu_spte_clear_track_bits:
+ * It sets the sptep from present to nonpresent, and track the
+ * state bits, it is used to clear the last level sptep.
+ */
+static int mmu_spte_clear_track_bits(u64 *sptep)
+{
+	pfn_t pfn;
+	u64 old_spte = *sptep;
+
+	if (!spte_has_volatile_bits(old_spte))
+		__set_spte(sptep, 0ull);
+	else
+		old_spte = __xchg_spte(sptep, 0ull);
+
+	if (!is_rmap_spte(old_spte))
+		return 0;
+
+	pfn = spte_to_pfn(old_spte);
+	if (!shadow_accessed_mask || old_spte & shadow_accessed_mask)
+		kvm_set_pfn_accessed(pfn);
+	if (!shadow_dirty_mask || (old_spte & shadow_dirty_mask))
+		kvm_set_pfn_dirty(pfn);
+	return 1;
+}
+
+/*
+ * Rules for using mmu_spte_clear_no_track:
+ * Directly clear spte without caring the state bits of sptep,
+ * it is used to set the upper level spte.
+ */
+static void mmu_spte_clear_no_track(u64 *sptep)
+{
+	__set_spte(sptep, 0ull);
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 				  struct kmem_cache *base_cache, int min)
 {
@@ -746,30 +800,9 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 	pte_list_remove(spte, rmapp);
 }
 
-static int set_spte_track_bits(u64 *sptep, u64 new_spte)
-{
-	pfn_t pfn;
-	u64 old_spte = *sptep;
-
-	if (!spte_has_volatile_bits(old_spte))
-		__set_spte(sptep, new_spte);
-	else
-		old_spte = __xchg_spte(sptep, new_spte);
-
-	if (!is_rmap_spte(old_spte))
-		return 0;
-
-	pfn = spte_to_pfn(old_spte);
-	if (!shadow_accessed_mask || old_spte & shadow_accessed_mask)
-		kvm_set_pfn_accessed(pfn);
-	if (!shadow_dirty_mask || (old_spte & shadow_dirty_mask))
-		kvm_set_pfn_dirty(pfn);
-	return 1;
-}
-
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
-	if (set_spte_track_bits(sptep, 0ull))
+	if (mmu_spte_clear_track_bits(sptep))
 		rmap_remove(kvm, sptep);
 }
 
@@ -787,7 +820,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 		BUG_ON(!(*spte & PT_PRESENT_MASK));
 		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
 		if (is_writable_pte(*spte)) {
-			update_spte(spte, *spte & ~PT_WRITABLE_MASK);
+			mmu_spte_update(spte, *spte & ~PT_WRITABLE_MASK);
 			write_protected = 1;
 		}
 		spte = rmap_next(kvm, rmapp, spte);
@@ -856,7 +889,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
 			new_spte &= ~PT_WRITABLE_MASK;
 			new_spte &= ~SPTE_HOST_WRITEABLE;
 			new_spte &= ~shadow_accessed_mask;
-			set_spte_track_bits(spte, new_spte);
+			mmu_spte_clear_track_bits(spte);
+			mmu_spte_set(spte, new_spte);
 			spte = rmap_next(kvm, rmapp, spte);
 		}
 	}
@@ -1077,7 +1111,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 			    u64 *parent_pte)
 {
 	mmu_page_remove_parent_pte(sp, parent_pte);
-	__set_spte(parent_pte, 0ull);
+	mmu_spte_clear_no_track(parent_pte);
 }
 
 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
@@ -1525,7 +1559,7 @@ static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
 	spte = __pa(sp->spt)
 		| PT_PRESENT_MASK | PT_ACCESSED_MASK
 		| PT_WRITABLE_MASK | PT_USER_MASK;
-	__set_spte(sptep, spte);
+	mmu_spte_set(sptep, spte);
 }
 
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
@@ -1992,7 +2026,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		mark_page_dirty(vcpu->kvm, gfn);
 
 set_pte:
-	update_spte(sptep, spte);
+	mmu_spte_update(sptep, spte);
 	/*
 	 * If we overwrite a writable spte with a read-only one we
 	 * should flush remote TLBs. Otherwise rmap_write_protect
@@ -2198,11 +2232,11 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 				return -ENOMEM;
 			}
 
-			__set_spte(iterator.sptep,
-				   __pa(sp->spt)
-				   | PT_PRESENT_MASK | PT_WRITABLE_MASK
-				   | shadow_user_mask | shadow_x_mask
-				   | shadow_accessed_mask);
+			mmu_spte_set(iterator.sptep,
+				     __pa(sp->spt)
+				     | PT_PRESENT_MASK | PT_WRITABLE_MASK
+				     | shadow_user_mask | shadow_x_mask
+				     | shadow_accessed_mask);
 		}
 	}
 	return emulate;
@@ -3449,7 +3483,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 
 			/* avoid RMW */
 			if (is_writable_pte(pt[i]))
-				update_spte(&pt[i], pt[i] & ~PT_WRITABLE_MASK);
+				mmu_spte_update(&pt[i],
+						pt[i] & ~PT_WRITABLE_MASK);
 		}
 	}
 	kvm_flush_remote_tlbs(kvm);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 17/22] KVM: MMU: clean up spte updating and clearing
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (15 preceding siblings ...)
  2011-06-22 14:34 ` [PATCH v2 16/22] KVM: MMU: introduce the rules to modify shadow page table Xiao Guangrong
@ 2011-06-22 14:34 ` Xiao Guangrong
  2011-06-22 14:35 ` [PATCH 18/22] KVM: MMU: do not need atomicly to set/clear spte Xiao Guangrong
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Clean up the code between mmu_spte_clear_* and mmu_spte_update

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |   75 +++++++++++++++++++++++++++-------------------------
 1 files changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5ceb64a..339e1a3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -279,24 +279,51 @@ static u64 __xchg_spte(u64 *sptep, u64 new_spte)
 #endif
 }
 
-static bool spte_has_volatile_bits(u64 spte)
+static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
+{
+	return (old_spte & bit_mask) && !(new_spte & bit_mask);
+}
+
+static bool spte_has_volatile_bits(u64 spte, u64 keep_bits)
 {
+	bool access_nonvolatile = false, dirty_nonvolatile = false;
+
 	if (!shadow_accessed_mask)
 		return false;
 
-	if (!is_shadow_present_pte(spte))
-		return false;
+	if ((spte | keep_bits) & shadow_accessed_mask)
+		access_nonvolatile = true;
+
+	if (!is_writable_pte(spte) || ((spte | keep_bits) & shadow_dirty_mask))
+		dirty_nonvolatile = true;
 
-	if ((spte & shadow_accessed_mask) &&
-	      (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
+	if (access_nonvolatile && dirty_nonvolatile)
 		return false;
 
 	return true;
 }
 
-static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
+static void track_spte_bits(u64 old_spte, u64 keep_bits, bool always_track)
 {
-	return (old_spte & bit_mask) && !(new_spte & bit_mask);
+	if (always_track ||
+	      (spte_is_bit_cleared(old_spte, keep_bits, shadow_accessed_mask)))
+		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+
+	if (always_track ||
+	      (spte_is_bit_cleared(old_spte, keep_bits, shadow_dirty_mask)))
+		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+}
+
+static u64 spte_get_and_update_clear(u64 *sptep, u64 new_spte)
+{
+	u64 old_spte = *sptep;
+
+	if (!spte_has_volatile_bits(old_spte, new_spte))
+		__set_spte(sptep, new_spte);
+	else
+		old_spte = __xchg_spte(sptep, new_spte);
+
+	return old_spte;
 }
 
 /* Rules for using mmu_spte_set:
@@ -316,7 +343,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
  */
 static void mmu_spte_update(u64 *sptep, u64 new_spte)
 {
-	u64 mask, old_spte = *sptep;
+	u64 old_spte = *sptep;
 
 	WARN_ON(!is_rmap_spte(new_spte));
 
@@ -324,23 +351,8 @@ static void mmu_spte_update(u64 *sptep, u64 new_spte)
 		return mmu_spte_set(sptep, new_spte);
 
 	new_spte |= old_spte & shadow_dirty_mask;
-
-	mask = shadow_accessed_mask;
-	if (is_writable_pte(old_spte))
-		mask |= shadow_dirty_mask;
-
-	if (!spte_has_volatile_bits(old_spte) || (new_spte & mask) == mask)
-		__set_spte(sptep, new_spte);
-	else
-		old_spte = __xchg_spte(sptep, new_spte);
-
-	if (!shadow_accessed_mask)
-		return;
-
-	if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
-	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+	old_spte = spte_get_and_update_clear(sptep, new_spte);
+	track_spte_bits(old_spte, new_spte, false);
 }
 
 /*
@@ -350,22 +362,13 @@ static void mmu_spte_update(u64 *sptep, u64 new_spte)
  */
 static int mmu_spte_clear_track_bits(u64 *sptep)
 {
-	pfn_t pfn;
 	u64 old_spte = *sptep;
 
-	if (!spte_has_volatile_bits(old_spte))
-		__set_spte(sptep, 0ull);
-	else
-		old_spte = __xchg_spte(sptep, 0ull);
-
 	if (!is_rmap_spte(old_spte))
 		return 0;
 
-	pfn = spte_to_pfn(old_spte);
-	if (!shadow_accessed_mask || old_spte & shadow_accessed_mask)
-		kvm_set_pfn_accessed(pfn);
-	if (!shadow_dirty_mask || (old_spte & shadow_dirty_mask))
-		kvm_set_pfn_dirty(pfn);
+	old_spte = spte_get_and_update_clear(sptep, 0ull);
+	track_spte_bits(old_spte, 0ull, !shadow_accessed_mask);
 	return 1;
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 18/22] KVM: MMU: do not need atomicly to set/clear spte
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (16 preceding siblings ...)
  2011-06-22 14:34 ` [PATCH v2 17/22] KVM: MMU: clean up spte updating and clearing Xiao Guangrong
@ 2011-06-22 14:35 ` Xiao Guangrong
  2011-06-22 14:35 ` [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table Xiao Guangrong
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Now, the spte is just from nonprsent to present or present to nonprsent, so
we can use some trick to set/clear spte non-atomicly as linux kernel does

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |   82 +++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 339e1a3..0ba94e9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -259,26 +259,82 @@ static gfn_t pse36_gfn_delta(u32 gpte)
 	return (gpte & PT32_DIR_PSE36_MASK) << shift;
 }
 
+#ifdef CONFIG_X86_64
 static void __set_spte(u64 *sptep, u64 spte)
 {
-	set_64bit(sptep, spte);
+	*sptep = spte;
 }
 
-static u64 __xchg_spte(u64 *sptep, u64 new_spte)
+static void __update_clear_spte_fast(u64 *sptep, u64 spte)
 {
-#ifdef CONFIG_X86_64
-	return xchg(sptep, new_spte);
+	*sptep = spte;
+}
+
+static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
+{
+	return xchg(sptep, spte);
+}
 #else
-	u64 old_spte;
+union split_spte {
+	struct {
+		u32 spte_low;
+		u32 spte_high;
+	};
+	u64 spte;
+};
 
-	do {
-		old_spte = *sptep;
-	} while (cmpxchg64(sptep, old_spte, new_spte) != old_spte);
+static void __set_spte(u64 *sptep, u64 spte)
+{
+	union split_spte *ssptep, sspte;
 
-	return old_spte;
-#endif
+	ssptep = (union split_spte *)sptep;
+	sspte = (union split_spte)spte;
+
+	ssptep->spte_high = sspte.spte_high;
+
+	/*
+	 * If we map the spte from nonpresent to present, We should store
+	 * the high bits firstly, then set present bit, so cpu can not
+	 * fetch this spte while we are setting the spte.
+	 */
+	smp_wmb();
+
+	ssptep->spte_low = sspte.spte_low;
 }
 
+static void __update_clear_spte_fast(u64 *sptep, u64 spte)
+{
+	union split_spte *ssptep, sspte;
+
+	ssptep = (union split_spte *)sptep;
+	sspte = (union split_spte)spte;
+
+	ssptep->spte_low = sspte.spte_low;
+
+	/*
+	 * If we map the spte from present to nonpresent, we should clear
+	 * present bit firstly to avoid vcpu fetch the old high bits.
+	 */
+	smp_wmb();
+
+	ssptep->spte_high = sspte.spte_high;
+}
+
+static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
+{
+	union split_spte *ssptep, sspte, orig;
+
+	ssptep = (union split_spte *)sptep;
+	sspte = (union split_spte)spte;
+
+	/* xchg acts as a barrier before the setting of the high bits */
+	orig.spte_low = xchg(&ssptep->spte_low, sspte.spte_low);
+	orig.spte_high = ssptep->spte_high = sspte.spte_high;
+
+	return orig.spte;
+}
+#endif
+
 static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
 {
 	return (old_spte & bit_mask) && !(new_spte & bit_mask);
@@ -319,9 +375,9 @@ static u64 spte_get_and_update_clear(u64 *sptep, u64 new_spte)
 	u64 old_spte = *sptep;
 
 	if (!spte_has_volatile_bits(old_spte, new_spte))
-		__set_spte(sptep, new_spte);
+		__update_clear_spte_fast(sptep, new_spte);
 	else
-		old_spte = __xchg_spte(sptep, new_spte);
+		old_spte = __update_clear_spte_slow(sptep, new_spte);
 
 	return old_spte;
 }
@@ -379,7 +435,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
-	__set_spte(sptep, 0ull);
+	__update_clear_spte_fast(sptep, 0ull);
 }
 
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (17 preceding siblings ...)
  2011-06-22 14:35 ` [PATCH 18/22] KVM: MMU: do not need atomicly to set/clear spte Xiao Guangrong
@ 2011-06-22 14:35 ` Xiao Guangrong
  2011-06-29  9:16   ` Avi Kivity
  2011-06-22 14:35 ` [PATCH v2 20/22] KVM: MMU: reorganize struct kvm_shadow_walk_iterator Xiao Guangrong
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Use rcu to protect shadow pages table to be freed, so we can safely walk it,
it should run fastly and is needed by mmio page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    8 +++
 arch/x86/kvm/mmu.c              |  132 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 132 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42e577d..87a868e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -233,6 +233,12 @@ struct kvm_mmu_page {
 	unsigned int unsync_children;
 	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
 	DECLARE_BITMAP(unsync_child_bitmap, 512);
+
+#ifdef CONFIG_X86_32
+	int clear_spte_count;
+#endif
+
+	struct rcu_head rcu;
 };
 
 struct kvm_pv_mmu_op_buffer {
@@ -477,6 +483,8 @@ struct kvm_arch {
 	u64 hv_guest_os_id;
 	u64 hv_hypercall;
 
+	atomic_t reader_counter;
+
 	#ifdef CONFIG_KVM_MMU_AUDIT
 	int audit_point;
 	#endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0ba94e9..dad7ad9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -182,6 +182,12 @@ struct kvm_shadow_walk_iterator {
 	     shadow_walk_okay(&(_walker));			\
 	     shadow_walk_next(&(_walker)))
 
+#define for_each_shadow_entry_lockless(_vcpu, _addr, _walker, spte)	\
+	for (shadow_walk_init(&(_walker), _vcpu, _addr);		\
+	     shadow_walk_okay(&(_walker)) &&				\
+		({ spte = mmu_spte_get_lockless(_walker.sptep); 1; });	\
+	     __shadow_walk_next(&(_walker), spte))
+
 static struct kmem_cache *pte_list_desc_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
@@ -274,6 +280,11 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
 {
 	return xchg(sptep, spte);
 }
+
+static u64 __get_spte_lockless(u64 *sptep)
+{
+	return ACCESS_ONCE(*sptep);
+}
 #else
 union split_spte {
 	struct {
@@ -283,6 +294,18 @@ union split_spte {
 	u64 spte;
 };
 
+static void count_spte_clear(u64 *sptep, u64 spte)
+{
+	struct kvm_mmu_page *sp =  page_header(__pa(sptep));
+
+	if (is_shadow_present_pte(spte))
+		return;
+
+	/* Ensure the spte is completely set before we increase the count */
+	smp_wmb();
+	sp->clear_spte_count++;
+}
+
 static void __set_spte(u64 *sptep, u64 spte)
 {
 	union split_spte *ssptep, sspte;
@@ -318,6 +341,7 @@ static void __update_clear_spte_fast(u64 *sptep, u64 spte)
 	smp_wmb();
 
 	ssptep->spte_high = sspte.spte_high;
+	count_spte_clear(sptep, spte);
 }
 
 static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
@@ -330,9 +354,40 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
 	/* xchg acts as a barrier before the setting of the high bits */
 	orig.spte_low = xchg(&ssptep->spte_low, sspte.spte_low);
 	orig.spte_high = ssptep->spte_high = sspte.spte_high;
+	count_spte_clear(sptep, spte);
 
 	return orig.spte;
 }
+
+/*
+ * The idea using the light way get the spte on x86_32 guest is from
+ * gup_get_pte(arch/x86/mm/gup.c).
+ * The difference is we can not catch the spte tlb flush if we leave
+ * guest mode, so we emulate it by increase clear_spte_count when spte
+ * is cleared.
+ */
+static u64 __get_spte_lockless(u64 *sptep)
+{
+	struct kvm_mmu_page *sp =  page_header(__pa(sptep));
+	union split_spte spte, *orig = (union split_spte *)sptep;
+	int count;
+
+retry:
+	count = sp->clear_spte_count;
+	smp_rmb();
+
+	spte.spte_low = orig->spte_low;
+	smp_rmb();
+
+	spte.spte_high = orig->spte_high;
+	smp_rmb();
+
+	if (unlikely(spte.spte_low != orig->spte_low ||
+	      count != sp->clear_spte_count))
+		goto retry;
+
+	return spte.spte;
+}
 #endif
 
 static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
@@ -438,6 +493,28 @@ static void mmu_spte_clear_no_track(u64 *sptep)
 	__update_clear_spte_fast(sptep, 0ull);
 }
 
+static u64 mmu_spte_get_lockless(u64 *sptep)
+{
+	return __get_spte_lockless(sptep);
+}
+
+static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
+{
+	rcu_read_lock();
+	atomic_inc(&vcpu->kvm->arch.reader_counter);
+
+	/* Increase the counter before walking shadow page table */
+	smp_mb__after_atomic_inc();
+}
+
+static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
+{
+	/* Decrease the counter after walking shadow page table finished */
+	smp_mb__before_atomic_dec();
+	atomic_dec(&vcpu->kvm->arch.reader_counter);
+	rcu_read_unlock();
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 				  struct kmem_cache *base_cache, int min)
 {
@@ -1600,17 +1677,23 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
 	return true;
 }
 
-static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
+static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
+			       u64 spte)
 {
-	if (is_last_spte(*iterator->sptep, iterator->level)) {
+	if (is_last_spte(spte, iterator->level)) {
 		iterator->level = 0;
 		return;
 	}
 
-	iterator->shadow_addr = *iterator->sptep & PT64_BASE_ADDR_MASK;
+	iterator->shadow_addr = spte & PT64_BASE_ADDR_MASK;
 	--iterator->level;
 }
 
+static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
+{
+	return __shadow_walk_next(iterator, *iterator->sptep);
+}
+
 static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
 {
 	u64 spte;
@@ -1757,6 +1840,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	return ret;
 }
 
+static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
+{
+	struct kvm_mmu_page *sp;
+
+	list_for_each_entry(sp, invalid_list, link)
+		kvm_mmu_isolate_page(sp);
+}
+
+static void free_pages_rcu(struct rcu_head *head)
+{
+	struct kvm_mmu_page *next, *sp;
+
+	sp = container_of(head, struct kvm_mmu_page, rcu);
+	while (sp) {
+		if (!list_empty(&sp->link))
+			next = list_first_entry(&sp->link,
+				      struct kvm_mmu_page, link);
+		else
+			next = NULL;
+		kvm_mmu_free_page(sp);
+		sp = next;
+	}
+}
+
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 				    struct list_head *invalid_list)
 {
@@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	kvm_flush_remote_tlbs(kvm);
 
+	if (atomic_read(&kvm->arch.reader_counter)) {
+		kvm_mmu_isolate_pages(invalid_list);
+		sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+		list_del_init(invalid_list);
+		call_rcu(&sp->rcu, free_pages_rcu);
+		return;
+	}
+
 	do {
 		sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 		WARN_ON(!sp->role.invalid || sp->root_count);
@@ -3797,16 +3912,17 @@ out:
 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4])
 {
 	struct kvm_shadow_walk_iterator iterator;
+	u64 spte;
 	int nr_sptes = 0;
 
-	spin_lock(&vcpu->kvm->mmu_lock);
-	for_each_shadow_entry(vcpu, addr, iterator) {
-		sptes[iterator.level-1] = *iterator.sptep;
+	walk_shadow_page_lockless_begin(vcpu);
+	for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
+		sptes[iterator.level-1] = spte;
 		nr_sptes++;
-		if (!is_shadow_present_pte(*iterator.sptep))
+		if (!is_shadow_present_pte(spte))
 			break;
 	}
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	walk_shadow_page_lockless_end(vcpu);
 
 	return nr_sptes;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 20/22] KVM: MMU: reorganize struct kvm_shadow_walk_iterator
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (18 preceding siblings ...)
  2011-06-22 14:35 ` [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table Xiao Guangrong
@ 2011-06-22 14:35 ` Xiao Guangrong
  2011-06-22 14:36 ` [PATCH v2 21/22] KVM: MMU: mmio page fault support Xiao Guangrong
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Reorganize it for good using the cache

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dad7ad9..1319050 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -172,8 +172,8 @@ struct pte_list_desc {
 struct kvm_shadow_walk_iterator {
 	u64 addr;
 	hpa_t shadow_addr;
-	int level;
 	u64 *sptep;
+	int level;
 	unsigned index;
 };
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (19 preceding siblings ...)
  2011-06-22 14:35 ` [PATCH v2 20/22] KVM: MMU: reorganize struct kvm_shadow_walk_iterator Xiao Guangrong
@ 2011-06-22 14:36 ` Xiao Guangrong
  2011-06-22 21:59   ` Marcelo Tosatti
                     ` (2 more replies)
  2011-06-22 14:36 ` [PATCH v2 22/22] KVM: MMU: trace mmio page fault Xiao Guangrong
  2011-06-29  9:23 ` [PATCH v2 0/22] KVM: optimize for MMIO handled Avi Kivity
  22 siblings, 3 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

The idea is from Avi:

| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)

When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly

Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |  156 ++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h         |    1 +
 arch/x86/kvm/paging_tmpl.h |   18 ++++--
 arch/x86/kvm/vmx.c         |   10 +++-
 4 files changed, 173 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1319050..e69a47a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
 static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
+static u64 __read_mostly shadow_mmio_mask = (0xffull << 49 | 1ULL);
+
+static void mmu_spte_set(u64 *sptep, u64 spte);
+
+static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access)
+{
+	access &= ACC_WRITE_MASK | ACC_USER_MASK;
+
+	mmu_spte_set(sptep, shadow_mmio_mask | access | gfn << PAGE_SHIFT);
+}
+
+static bool is_mmio_spte(u64 spte)
+{
+	return (spte & shadow_mmio_mask) == shadow_mmio_mask;
+}
+
+static gfn_t get_mmio_spte_gfn(u64 spte)
+{
+	return (spte & ~shadow_mmio_mask) >> PAGE_SHIFT;
+}
+
+static unsigned get_mmio_spte_access(u64 spte)
+{
+	return (spte & ~shadow_mmio_mask) & ~PAGE_MASK;
+}
+
+static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access)
+{
+	if (unlikely(is_mmio_pfn(pfn))) {
+		mark_mmio_spte(sptep, gfn, access);
+		return true;
+	}
+
+	return false;
+}
 
 static inline u64 rsvd_bits(int s, int e)
 {
@@ -226,7 +261,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
 
 static int is_shadow_present_pte(u64 pte)
 {
-	return pte != 0ull;
+	return pte != 0ull && !is_mmio_spte(pte);
 }
 
 static int is_large_pte(u64 pte)
@@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	u64 spte, entry = *sptep;
 	int ret = 0;
 
+	if (set_mmio_spte(sptep, gfn, pfn, pte_access))
+		return 0;
+
 	/*
 	 * We don't set the accessed bit, since we sometimes want to see
 	 * whether the guest actually used the pte (in order to detect
@@ -2258,6 +2296,9 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		kvm_mmu_flush_tlb(vcpu);
 	}
 
+	if (unlikely(is_mmio_spte(*sptep) && emulate))
+		*emulate = 1;
+
 	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
 	pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n",
 		 is_large_pte(*sptep)? "2MB" : "4kB",
@@ -2484,7 +2525,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 
 static bool mmu_invalid_pfn(pfn_t pfn)
 {
-	return unlikely(is_invalid_pfn(pfn) || is_mmio_pfn(pfn));
+	return unlikely(is_invalid_pfn(pfn));
 }
 
 static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
@@ -2498,11 +2539,8 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
 		goto exit;
 	}
 
-	if (unlikely(is_mmio_pfn(pfn))) {
+	if (unlikely(is_mmio_pfn(pfn)))
 		vcpu_cache_mmio_info(vcpu, gva, gfn, access);
-		*ret_val = 1;
-		goto exit;
-	}
 
 	ret = false;
 exit:
@@ -2816,6 +2854,86 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
 	return vcpu->arch.nested_mmu.translate_gpa(vcpu, vaddr, access);
 }
 
+static bool quickly_check_mmio_pf(struct kvm_vcpu *vcpu, u64 addr, bool direct)
+{
+	if (direct)
+		return vcpu_match_mmio_gpa(vcpu, addr);
+
+	return vcpu_match_mmio_gva(vcpu, addr);
+}
+
+static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
+{
+	struct kvm_shadow_walk_iterator iterator;
+	u64 spte, ret = 0ull;
+
+	walk_shadow_page_lockless_begin(vcpu);
+	for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
+		if (iterator.level == 1) {
+			ret = spte;
+			break;
+		}
+
+		if (!is_shadow_present_pte(spte))
+			break;
+	}
+	walk_shadow_page_lockless_end(vcpu);
+
+	return ret;
+}
+
+/*
+ * If it is a real mmio page fault, return 1 and emulat the instruction
+ * directly, return 0 if it needs page fault path to fix it, -1 is
+ * returned if bug is detected.
+ */
+int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
+{
+	u64 spte;
+
+	if (quickly_check_mmio_pf(vcpu, addr, direct))
+		return 1;
+
+	spte = walk_shadow_page_get_mmio_spte(vcpu, addr);
+
+	if (is_mmio_spte(spte)) {
+		gfn_t gfn = get_mmio_spte_gfn(spte);
+		unsigned access = get_mmio_spte_access(spte);
+
+		if (direct)
+			addr = 0;
+		vcpu_cache_mmio_info(vcpu, addr, gfn, access);
+		return 1;
+	}
+
+	/*
+	 * It's ok if the gva is remapped by other cpus on shadow guest,
+	 * it's a BUG if the gfn is not a mmio page.
+	 */
+	if (direct && is_shadow_present_pte(spte))
+		return -1;
+
+	/*
+	 * If the page table is zapped by other cpus, let the page
+	 * fault path to fix it.
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(handle_mmio_page_fault_common);
+
+static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr,
+				  u32 error_code, bool direct)
+{
+	int ret;
+
+	if (likely(!(error_code & PFERR_RSVD_MASK)))
+		return 0;
+
+	ret = handle_mmio_page_fault_common(vcpu, addr, direct);
+	WARN_ON(ret < 0);
+	return ret;
+}
+
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 				u32 error_code, bool prefault)
 {
@@ -2823,6 +2941,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 	int r;
 
 	pgprintk("%s: gva %lx error %x\n", __func__, gva, error_code);
+
+	r = handle_mmio_page_fault(vcpu, gva, error_code, true);
+	if (r)
+		return r;
+
 	r = mmu_topup_memory_caches(vcpu);
 	if (r)
 		return r;
@@ -2899,6 +3022,10 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	ASSERT(vcpu);
 	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
 
+	r = handle_mmio_page_fault(vcpu, gpa, error_code, true);
+	if (r)
+		return r;
+
 	r = mmu_topup_memory_caches(vcpu);
 	if (r)
 		return r;
@@ -2996,6 +3123,23 @@ static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
 	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
 }
 
+static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
+			   int *nr_present)
+{
+	if (unlikely(is_mmio_spte(*sptep))) {
+		if (gfn != get_mmio_spte_gfn(*sptep)) {
+			mmu_spte_clear_no_track(sptep);
+			return true;
+		}
+
+		(*nr_present)++;
+		mark_mmio_spte(sptep, gfn, access);
+		return true;
+	}
+
+	return false;
+}
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 05310b1..ebf09bf 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -49,6 +49,7 @@
 #define PFERR_FETCH_MASK (1U << 4)
 
 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
+int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 870be69..cbdc1ae 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -583,6 +583,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 
+	r = handle_mmio_page_fault(vcpu, addr, error_code, mmu_is_nested(vcpu));
+	if (r)
+		return r;
+
 	r = mmu_topup_memory_caches(vcpu);
 	if (r)
 		return r;
@@ -786,7 +790,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		gpa_t pte_gpa;
 		gfn_t gfn;
 
-		if (!is_shadow_present_pte(sp->spt[i]))
+		if (!sp->spt[i])
 			continue;
 
 		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
@@ -795,13 +799,18 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 					  sizeof(pt_element_t)))
 			return -EINVAL;
 
-		gfn = gpte_to_gfn(gpte);
-
 		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
 			vcpu->kvm->tlbs_dirty++;
 			continue;
 		}
 
+		gfn = gpte_to_gfn(gpte);
+		pte_access = sp->role.access;
+		pte_access &= FNAME(gpte_access)(vcpu, gpte, true);
+
+		if (sync_mmio_spte(&sp->spt[i], gfn, pte_access, &nr_present))
+			continue;
+
 		if (gfn != sp->gfns[i]) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			vcpu->kvm->tlbs_dirty++;
@@ -809,8 +818,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		}
 
 		nr_present++;
-		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte,
-								  true);
+
 		host_writable = sp->spt[i] & SPTE_HOST_WRITEABLE;
 
 		set_spte(vcpu, &sp->spt[i], pte_access, 0, 0,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a644acb..37a49f1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4671,11 +4671,19 @@ static void ept_misconfig_inspect_spte(struct kvm_vcpu *vcpu, u64 spte,
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 {
 	u64 sptes[4];
-	int nr_sptes, i;
+	int nr_sptes, i, ret;
 	gpa_t gpa;
 
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 
+	ret = handle_mmio_page_fault_common(vcpu, gpa, true);
+	if (likely(ret == 1))
+		return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
+					      EMULATE_DONE;
+	if (unlikely(!ret))
+		return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0);
+
+	/* It is the real ept misconfig */
 	printk(KERN_ERR "EPT: Misconfiguration.\n");
 	printk(KERN_ERR "EPT: GPA: 0x%llx\n", gpa);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v2 22/22] KVM: MMU: trace mmio page fault
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (20 preceding siblings ...)
  2011-06-22 14:36 ` [PATCH v2 21/22] KVM: MMU: mmio page fault support Xiao Guangrong
@ 2011-06-22 14:36 ` Xiao Guangrong
  2011-06-29  9:23 ` [PATCH v2 0/22] KVM: optimize for MMIO handled Avi Kivity
  22 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-22 14:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

Add tracepoints to trace mmio page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c      |    5 ++++
 arch/x86/kvm/mmutrace.h |   48 +++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/trace.h    |   23 ++++++++++++++++++++++
 arch/x86/kvm/x86.c      |    5 +++-
 4 files changed, 80 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e69a47a..5a4e36c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -205,6 +205,7 @@ static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access)
 {
 	access &= ACC_WRITE_MASK | ACC_USER_MASK;
 
+	trace_mark_mmio_spte(sptep, gfn, access);
 	mmu_spte_set(sptep, shadow_mmio_mask | access | gfn << PAGE_SHIFT);
 }
 
@@ -1913,6 +1914,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 		kvm_mmu_isolate_pages(invalid_list);
 		sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 		list_del_init(invalid_list);
+
+		trace_kvm_mmu_delay_free_pages(sp);
 		call_rcu(&sp->rcu, free_pages_rcu);
 		return;
 	}
@@ -2902,6 +2905,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 
 		if (direct)
 			addr = 0;
+
+		trace_handle_mmio_page_fault(addr, gfn, access);
 		vcpu_cache_mmio_info(vcpu, addr, gfn, access);
 		return 1;
 	}
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index b60b4fd..eed67f3 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -196,6 +196,54 @@ DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_prepare_zap_page,
 	TP_ARGS(sp)
 );
 
+DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_delay_free_pages,
+	TP_PROTO(struct kvm_mmu_page *sp),
+
+	TP_ARGS(sp)
+);
+
+TRACE_EVENT(
+	mark_mmio_spte,
+	TP_PROTO(u64 *sptep, gfn_t gfn, unsigned access),
+	TP_ARGS(sptep, gfn, access),
+
+	TP_STRUCT__entry(
+		__field(void *, sptep)
+		__field(gfn_t, gfn)
+		__field(unsigned, access)
+	),
+
+	TP_fast_assign(
+		__entry->sptep = sptep;
+		__entry->gfn = gfn;
+		__entry->access = access;
+	),
+
+	TP_printk("sptep:%p gfn %llx access %x", __entry->sptep, __entry->gfn,
+		  __entry->access)
+);
+
+TRACE_EVENT(
+	handle_mmio_page_fault,
+	TP_PROTO(u64 addr, gfn_t gfn, unsigned access),
+	TP_ARGS(addr, gfn, access),
+
+	TP_STRUCT__entry(
+		__field(u64, addr)
+		__field(gfn_t, gfn)
+		__field(unsigned, access)
+	),
+
+	TP_fast_assign(
+		__entry->addr = addr;
+		__entry->gfn = gfn;
+		__entry->access = access;
+	),
+
+	TP_printk("addr:%llx gfn %llx access %x", __entry->addr, __entry->gfn,
+		  __entry->access)
+);
+
 TRACE_EVENT(
 	kvm_mmu_audit,
 	TP_PROTO(struct kvm_vcpu *vcpu, int audit_point),
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 624f8cb..3ff898c 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -698,6 +698,29 @@ TRACE_EVENT(kvm_emulate_insn,
 #define trace_kvm_emulate_insn_start(vcpu) trace_kvm_emulate_insn(vcpu, 0)
 #define trace_kvm_emulate_insn_failed(vcpu) trace_kvm_emulate_insn(vcpu, 1)
 
+TRACE_EVENT(
+	vcpu_match_mmio,
+	TP_PROTO(gva_t gva, gpa_t gpa, bool write, bool gpa_match),
+	TP_ARGS(gva, gpa, write, gpa_match),
+
+	TP_STRUCT__entry(
+		__field(gva_t, gva)
+		__field(gpa_t, gpa)
+		__field(bool, write)
+		__field(bool, gpa_match)
+		),
+
+	TP_fast_assign(
+		__entry->gva = gva;
+		__entry->gpa = gpa;
+		__entry->write = write;
+		__entry->gpa_match = gpa_match
+		),
+
+	TP_printk("gva %#lx gpa %#llx %s %s", __entry->gva, __entry->gpa,
+		  __entry->write ? "Write" : "Read",
+		  __entry->gpa_match ? "GPA" : "GVA")
+);
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 78d5519..24a0f2c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3955,6 +3955,7 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 		  vcpu->arch.access)) {
 		*gpa = vcpu->arch.mmio_gfn << PAGE_SHIFT |
 					(gva & (PAGE_SIZE - 1));
+		trace_vcpu_match_mmio(gva, *gpa, write, false);
 		return 1;
 	}
 
@@ -3970,8 +3971,10 @@ static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 	if ((*gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
 		return 1;
 
-	if (vcpu_match_mmio_gpa(vcpu, *gpa))
+	if (vcpu_match_mmio_gpa(vcpu, *gpa)) {
+		trace_vcpu_match_mmio(gva, *gpa, write, true);
 		return 1;
+	}
 
 	return 0;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 01/22] KVM: MMU: fix walking shadow page table
  2011-06-22 14:28 ` [PATCH v2 01/22] KVM: MMU: fix walking shadow page table Xiao Guangrong
@ 2011-06-22 17:13   ` Marcelo Tosatti
  2011-06-23  2:05     ` Xiao Guangrong
  2011-06-27  6:35     ` Xiao Guangrong
  0 siblings, 2 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2011-06-22 17:13 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Wed, Jun 22, 2011 at 10:28:04PM +0800, Xiao Guangrong wrote:
> Properly check the last mapping, and do not walk to the next level if last spte
> is met
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
> ---
>  arch/x86/kvm/mmu.c |    9 +++++----
>  1 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9c629b5..f474e93 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
>  	if (iterator->level < PT_PAGE_TABLE_LEVEL)
>  		return false;
>  
> -	if (iterator->level == PT_PAGE_TABLE_LEVEL)

Change to >= PT_PAGE_TABLE_LEVEL, checks should be in shadow_walk_okay.

> -		if (is_large_pte(*iterator->sptep))
> -			return false;
> -
>  	iterator->index = SHADOW_PT_INDEX(iterator->addr, iterator->level);
>  	iterator->sptep	= ((u64 *)__va(iterator->shadow_addr)) + iterator->index;
>  	return true;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-22 14:36 ` [PATCH v2 21/22] KVM: MMU: mmio page fault support Xiao Guangrong
@ 2011-06-22 21:59   ` Marcelo Tosatti
  2011-06-23  3:19     ` Xiao Guangrong
  2011-06-27 11:00   ` [PATCH v3 " Xiao Guangrong
  2011-06-29  9:22   ` [PATCH v2 " Avi Kivity
  2 siblings, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2011-06-22 21:59 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM


Xiao,

On Wed, Jun 22, 2011 at 10:36:16PM +0800, Xiao Guangrong wrote:
> The idea is from Avi:
> 
> | We could cache the result of a miss in an spte by using a reserved bit, and
> | checking the page fault error code (or seeing if we get an ept violation or
> | ept misconfiguration), so if we get repeated mmio on a page, we don't need to
> | search the slot list/tree.
> | (https://lkml.org/lkml/2011/2/22/221)
> 
> When the page fault is caused by mmio, we cache the info in the shadow page
> table, and also set the reserved bits in the shadow page table, so if the mmio
> is caused again, we can quickly identify it and emulate it directly
> 
> Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
> can be reduced by this feature, and also avoid walking guest page table for
> soft mmu.
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
> ---
>  arch/x86/kvm/mmu.c         |  156 ++++++++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu.h         |    1 +
>  arch/x86/kvm/paging_tmpl.h |   18 ++++--
>  arch/x86/kvm/vmx.c         |   10 +++-
>  4 files changed, 173 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 1319050..e69a47a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
>  static u64 __read_mostly shadow_user_mask;
>  static u64 __read_mostly shadow_accessed_mask;
>  static u64 __read_mostly shadow_dirty_mask;
> +static u64 __read_mostly shadow_mmio_mask = (0xffull << 49 | 1ULL);
> +
> +static void mmu_spte_set(u64 *sptep, u64 spte);
> +
> +static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access)
> +{
> +	access &= ACC_WRITE_MASK | ACC_USER_MASK;
> +
> +	mmu_spte_set(sptep, shadow_mmio_mask | access | gfn << PAGE_SHIFT);
> +}
> +
> +static bool is_mmio_spte(u64 spte)
> +{
> +	return (spte & shadow_mmio_mask) == shadow_mmio_mask;
> +}
> +
> +static gfn_t get_mmio_spte_gfn(u64 spte)
> +{
> +	return (spte & ~shadow_mmio_mask) >> PAGE_SHIFT;
> +}
> +
> +static unsigned get_mmio_spte_access(u64 spte)
> +{
> +	return (spte & ~shadow_mmio_mask) & ~PAGE_MASK;
> +}
> +
> +static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access)
> +{
> +	if (unlikely(is_mmio_pfn(pfn))) {
> +		mark_mmio_spte(sptep, gfn, access);
> +		return true;
> +	}
> +
> +	return false;
> +}
>  
>  static inline u64 rsvd_bits(int s, int e)
>  {
> @@ -226,7 +261,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
>  
>  static int is_shadow_present_pte(u64 pte)
>  {
> -	return pte != 0ull;
> +	return pte != 0ull && !is_mmio_spte(pte);
>  }
>  
>  static int is_large_pte(u64 pte)
> @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  	u64 spte, entry = *sptep;
>  	int ret = 0;
>  
> +	if (set_mmio_spte(sptep, gfn, pfn, pte_access))
> +		return 0;
> +


Should zap all mmio sptes when creating new memory slots (current qemu
never exhibits that pattern, but its not an unsupported scenario).
Easier to zap all mmu in that case.

For shadow you need to remove mmio sptes on invlpg and all mmio sptes
under CR3 (mmu_sync_roots), other than clearing the gva/gpa variables.

Otherwise you can move the mmio info from an mmio spte back to
mmio_gva/mmio_gfn after a TLB flush, without rereading the guest
pagetable.

> +{
> +	struct kvm_shadow_walk_iterator iterator;
> +	u64 spte, ret = 0ull;
> +
> +	walk_shadow_page_lockless_begin(vcpu);
> +	for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
> +		if (iterator.level == 1) {
> +			ret = spte;
> +			break;
> +		}

Have you verified it is safe to walk sptes with parallel modifications
to them? Other than shadow page deletion which is in theory covered by
RCU. It would be good to have all cases written down.

> +
> +		if (!is_shadow_present_pte(spte))
> +			break;
> +	}
> +	walk_shadow_page_lockless_end(vcpu);
> +
> +	return ret;
> +}
> +
> +/*
> + * If it is a real mmio page fault, return 1 and emulat the instruction
> + * directly, return 0 if it needs page fault path to fix it, -1 is
> + * returned if bug is detected.
> + */
> +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
> +{
> +	u64 spte;
> +
> +	if (quickly_check_mmio_pf(vcpu, addr, direct))
> +		return 1;
> +
> +	spte = walk_shadow_page_get_mmio_spte(vcpu, addr);
> +
> +	if (is_mmio_spte(spte)) {
> +		gfn_t gfn = get_mmio_spte_gfn(spte);
> +		unsigned access = get_mmio_spte_access(spte);
> +
> +		if (direct)
> +			addr = 0;
> +		vcpu_cache_mmio_info(vcpu, addr, gfn, access);
> +		return 1;
> +	}
> +
> +	/*
> +	 * It's ok if the gva is remapped by other cpus on shadow guest,
> +	 * it's a BUG if the gfn is not a mmio page.
> +	 */

If the gva is remapped there should be no mmio page fault in the first
place.

> +	if (direct && is_shadow_present_pte(spte))
> +		return -1;

An spte does not have to contain the present bit to generate a valid EPT
misconfiguration (and an spte dump is still required in that case).
Use !is_mmio_spte() instead.


> +
> +	/*
> +	 * If the page table is zapped by other cpus, let the page
> +	 * fault path to fix it.
> +	 */
> +	return 0;
> +}

I don't understand when would this happen, can you please explain?

> +EXPORT_SYMBOL_GPL(handle_mmio_page_fault_common);
> +
> +static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr,
> +				  u32 error_code, bool direct)
> +{
> +	int ret;
> +
> +	if (likely(!(error_code & PFERR_RSVD_MASK)))
> +		return 0;
> +
> +	ret = handle_mmio_page_fault_common(vcpu, addr, direct);
> +	WARN_ON(ret < 0);
> +	return ret;
> +}
> +
>  static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
>  				u32 error_code, bool prefault)
>  {
> @@ -2823,6 +2941,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
>  	int r;
>  
>  	pgprintk("%s: gva %lx error %x\n", __func__, gva, error_code);
> +
> +	r = handle_mmio_page_fault(vcpu, gva, error_code, true);
> +	if (r)
> +		return r;
> +
>  	r = mmu_topup_memory_caches(vcpu);
>  	if (r)
>  		return r;
> @@ -2899,6 +3022,10 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
>  	ASSERT(vcpu);
>  	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
>  
> +	r = handle_mmio_page_fault(vcpu, gpa, error_code, true);
> +	if (r)
> +		return r;
> +
>  	r = mmu_topup_memory_caches(vcpu);
>  	if (r)
>  		return r;
> @@ -2996,6 +3123,23 @@ static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
>  	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
>  }
>  
> +static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
> +			   int *nr_present)
> +{
> +	if (unlikely(is_mmio_spte(*sptep))) {
> +		if (gfn != get_mmio_spte_gfn(*sptep)) {
> +			mmu_spte_clear_no_track(sptep);
> +			return true;
> +		}
> +
> +		(*nr_present)++;
> +		mark_mmio_spte(sptep, gfn, access);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
>  #define PTTYPE 64
>  #include "paging_tmpl.h"
>  #undef PTTYPE
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 05310b1..ebf09bf 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -49,6 +49,7 @@
>  #define PFERR_FETCH_MASK (1U << 4)
>  
>  int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
> +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
>  int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
>  
>  static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 870be69..cbdc1ae 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -583,6 +583,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
>  
>  	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
>  
> +	r = handle_mmio_page_fault(vcpu, addr, error_code, mmu_is_nested(vcpu));
> +	if (r)
> +		return r;
> +
>  	r = mmu_topup_memory_caches(vcpu);
>  	if (r)
>  		return r;
> @@ -786,7 +790,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		gpa_t pte_gpa;
>  		gfn_t gfn;
>  
> -		if (!is_shadow_present_pte(sp->spt[i]))
> +		if (!sp->spt[i])
>  			continue;
>  
>  		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
> @@ -795,13 +799,18 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  					  sizeof(pt_element_t)))
>  			return -EINVAL;
>  
> -		gfn = gpte_to_gfn(gpte);
> -
>  		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
>  			vcpu->kvm->tlbs_dirty++;
>  			continue;
>  		}
>  
> +		gfn = gpte_to_gfn(gpte);
> +		pte_access = sp->role.access;
> +		pte_access &= FNAME(gpte_access)(vcpu, gpte, true);
> +
> +		if (sync_mmio_spte(&sp->spt[i], gfn, pte_access, &nr_present))
> +			continue;
> +
>  		if (gfn != sp->gfns[i]) {
>  			drop_spte(vcpu->kvm, &sp->spt[i]);
>  			vcpu->kvm->tlbs_dirty++;
> @@ -809,8 +818,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		}
>  
>  		nr_present++;
> -		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte,
> -								  true);
> +
>  		host_writable = sp->spt[i] & SPTE_HOST_WRITEABLE;
>  
>  		set_spte(vcpu, &sp->spt[i], pte_access, 0, 0,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a644acb..37a49f1 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4671,11 +4671,19 @@ static void ept_misconfig_inspect_spte(struct kvm_vcpu *vcpu, u64 spte,
>  static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>  {
>  	u64 sptes[4];
> -	int nr_sptes, i;
> +	int nr_sptes, i, ret;
>  	gpa_t gpa;
>  
>  	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>  
> +	ret = handle_mmio_page_fault_common(vcpu, gpa, true);
> +	if (likely(ret == 1))
> +		return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
> +					      EMULATE_DONE;
> +	if (unlikely(!ret))
> +		return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0);
> +
> +	/* It is the real ept misconfig */
>  	printk(KERN_ERR "EPT: Misconfiguration.\n");
>  	printk(KERN_ERR "EPT: GPA: 0x%llx\n", gpa);
>  
> -- 
> 1.7.5.4

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 01/22] KVM: MMU: fix walking shadow page table
  2011-06-22 17:13   ` Marcelo Tosatti
@ 2011-06-23  2:05     ` Xiao Guangrong
  2011-06-27  6:35     ` Xiao Guangrong
  1 sibling, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-23  2:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 06/23/2011 01:13 AM, Marcelo Tosatti wrote:
> On Wed, Jun 22, 2011 at 10:28:04PM +0800, Xiao Guangrong wrote:
>> Properly check the last mapping, and do not walk to the next level if last spte
>> is met
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
>> ---
>>  arch/x86/kvm/mmu.c |    9 +++++----
>>  1 files changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 9c629b5..f474e93 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
>>  	if (iterator->level < PT_PAGE_TABLE_LEVEL)
>>  		return false;
>>  
>> -	if (iterator->level == PT_PAGE_TABLE_LEVEL)
> 
> Change to >= PT_PAGE_TABLE_LEVEL, checks should be in shadow_walk_okay.
> 

OK, will fix, thanks!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-22 21:59   ` Marcelo Tosatti
@ 2011-06-23  3:19     ` Xiao Guangrong
  2011-06-23  6:40       ` Xiao Guangrong
  2011-06-23 14:21       ` Marcelo Tosatti
  0 siblings, 2 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-23  3:19 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

Marcelo,

Thanks for your review!

On 06/23/2011 05:59 AM, Marcelo Tosatti wrote:

>>  static int is_large_pte(u64 pte)
>> @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>>  	u64 spte, entry = *sptep;
>>  	int ret = 0;
>>  
>> +	if (set_mmio_spte(sptep, gfn, pfn, pte_access))
>> +		return 0;
>> +
> 
> 
> Should zap all mmio sptes when creating new memory slots (current qemu
> never exhibits that pattern, but its not an unsupported scenario).
> Easier to zap all mmu in that case.
> 

OK, will do it in the next version.

> For shadow you need to remove mmio sptes on invlpg and all mmio sptes
> under CR3 (mmu_sync_roots), other than clearing the gva/gpa variables.
> 

Oh, kvm_mmu_pte_write and invlpg do not zap mmio spte, i will
fix these, thanks for your point it out.

For mmu_sync_roots, we properly do it in FNAME(sync_page) in this patch:

static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
			   int *nr_present)
{
	if (unlikely(is_mmio_spte(*sptep))) {
		if (gfn != get_mmio_spte_gfn(*sptep)) {
			mmu_spte_clear_no_track(sptep);
			return true;
		}

		(*nr_present)++;
		mark_mmio_spte(sptep, gfn, access);
		return true;
	}

	return false;
}

> Otherwise you can move the mmio info from an mmio spte back to
> mmio_gva/mmio_gfn after a TLB flush, without rereading the guest
> pagetable.
> 

We do not read the guest page table when mmio page fault occurred,
we just do it as you say:
- Firstly, walk the shadow page table to get the mmio spte
- Then, cache the mmio spte info to mmio_gva/mmio_gfn

I missed your meaning?

>> +{
>> +	struct kvm_shadow_walk_iterator iterator;
>> +	u64 spte, ret = 0ull;
>> +
>> +	walk_shadow_page_lockless_begin(vcpu);
>> +	for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
>> +		if (iterator.level == 1) {
>> +			ret = spte;
>> +			break;
>> +		}
> 
> Have you verified it is safe to walk sptes with parallel modifications
> to them? Other than shadow page deletion which is in theory covered by
> RCU. It would be good to have all cases written down.
> 

Um, i think it is safe, when spte is being write, we can get the old spte
or the new spte, both cases are ok for us: it is just like the page structure
cache on the real machine, the cpu can fetch the old spte even if the page
structure is changed by other cpus.

>> +
>> +		if (!is_shadow_present_pte(spte))
>> +			break;
>> +	}
>> +	walk_shadow_page_lockless_end(vcpu);
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * If it is a real mmio page fault, return 1 and emulat the instruction
>> + * directly, return 0 if it needs page fault path to fix it, -1 is
>> + * returned if bug is detected.
>> + */
>> +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
>> +{
>> +	u64 spte;
>> +
>> +	if (quickly_check_mmio_pf(vcpu, addr, direct))
>> +		return 1;
>> +
>> +	spte = walk_shadow_page_get_mmio_spte(vcpu, addr);
>> +
>> +	if (is_mmio_spte(spte)) {
>> +		gfn_t gfn = get_mmio_spte_gfn(spte);
>> +		unsigned access = get_mmio_spte_access(spte);
>> +
>> +		if (direct)
>> +			addr = 0;
>> +		vcpu_cache_mmio_info(vcpu, addr, gfn, access);
>> +		return 1;
>> +	}
>> +
>> +	/*
>> +	 * It's ok if the gva is remapped by other cpus on shadow guest,
>> +	 * it's a BUG if the gfn is not a mmio page.
>> +	 */
> 
> If the gva is remapped there should be no mmio page fault in the first
> place.
> 

For example, as follow case:

VCPU 0                                     VCPU 1
gva is mapped to a mmio region
                                        access gva
remap gva to other page
                                       emulate mmio access by kvm
                                       (*the point we are discussing*)
flush all tlb

In theory, it can happen.

>> +	if (direct && is_shadow_present_pte(spte))
>> +		return -1;
> 
> An spte does not have to contain the present bit to generate a valid EPT
> misconfiguration (and an spte dump is still required in that case).
> Use !is_mmio_spte() instead.
> 

We can not use !is_mmio_spte() here, since the shadow page can be zapped anytime,
for example:

sp.spt[i] = mmio-spte

          VCPU 0                                    VCPU 1    
Access sp.spte[i], ept misconfig is occurred
                                                   delete sp
                                   (if the number of shadow page is out of the limit
                                    or page shrink is required, and other events...)

Walk shadow page out of the lock and get the
non-present spte
(*the point we are discussing*)

So, the bug we can detect is: it is the mmio access but the spte is point to the normal
page.

> 
>> +
>> +	/*
>> +	 * If the page table is zapped by other cpus, let the page
>> +	 * fault path to fix it.
>> +	 */
>> +	return 0;
>> +}
> 
> I don't understand when would this happen, can you please explain?
> 

The case is above :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-23  3:19     ` Xiao Guangrong
@ 2011-06-23  6:40       ` Xiao Guangrong
  2011-06-23 14:21       ` Marcelo Tosatti
  1 sibling, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-23  6:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 06/23/2011 11:19 AM, Xiao Guangrong wrote:

>> Otherwise you can move the mmio info from an mmio spte back to
>> mmio_gva/mmio_gfn after a TLB flush, without rereading the guest
>> pagetable.
>>
> 
> We do not read the guest page table when mmio page fault occurred,
> we just do it as you say:
> - Firstly, walk the shadow page table to get the mmio spte
> - Then, cache the mmio spte info to mmio_gva/mmio_gfn
> 
> I missed your meaning?
> 

Marcelo, sorry for my poor English, i missed your meaning, please ignore
this. :(

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-23  3:19     ` Xiao Guangrong
  2011-06-23  6:40       ` Xiao Guangrong
@ 2011-06-23 14:21       ` Marcelo Tosatti
  2011-06-23 17:55         ` Xiao Guangrong
  1 sibling, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2011-06-23 14:21 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Thu, Jun 23, 2011 at 11:19:26AM +0800, Xiao Guangrong wrote:
> Marcelo,
> 
> Thanks for your review!
> 
> On 06/23/2011 05:59 AM, Marcelo Tosatti wrote:
> 
> >>  static int is_large_pte(u64 pte)
> >> @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> >>  	u64 spte, entry = *sptep;
> >>  	int ret = 0;
> >>  
> >> +	if (set_mmio_spte(sptep, gfn, pfn, pte_access))
> >> +		return 0;
> >> +
> > 
> > 
> > Should zap all mmio sptes when creating new memory slots (current qemu
> > never exhibits that pattern, but its not an unsupported scenario).
> > Easier to zap all mmu in that case.
> > 
> 
> OK, will do it in the next version.
> 
> > For shadow you need to remove mmio sptes on invlpg and all mmio sptes
> > under CR3 (mmu_sync_roots), other than clearing the gva/gpa variables.
> > 
> 
> Oh, kvm_mmu_pte_write and invlpg do not zap mmio spte, i will
> fix these, thanks for your point it out.
> 
> For mmu_sync_roots, we properly do it in FNAME(sync_page) in this patch:
> 
> static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
> 			   int *nr_present)
> {
> 	if (unlikely(is_mmio_spte(*sptep))) {
> 		if (gfn != get_mmio_spte_gfn(*sptep)) {
> 			mmu_spte_clear_no_track(sptep);
> 			return true;
> 		}
> 
> 		(*nr_present)++;
> 		mark_mmio_spte(sptep, gfn, access);
> 		return true;
> 	}
> 
> 	return false;
> }
> 
> > Otherwise you can move the mmio info from an mmio spte back to
> > mmio_gva/mmio_gfn after a TLB flush, without rereading the guest
> > pagetable.
> > 
> 
> We do not read the guest page table when mmio page fault occurred,
> we just do it as you say:
> - Firstly, walk the shadow page table to get the mmio spte
> - Then, cache the mmio spte info to mmio_gva/mmio_gfn
> 
> I missed your meaning?

I meant that guest pagetables must be read again on CR3 switch or invlpg
(GVA->GPA can change in that case), which only happens if mmio sptes are
zapped.

> >> +	struct kvm_shadow_walk_iterator iterator;
> >> +	u64 spte, ret = 0ull;
> >> +
> >> +	walk_shadow_page_lockless_begin(vcpu);
> >> +	for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
> >> +		if (iterator.level == 1) {
> >> +			ret = spte;
> >> +			break;
> >> +		}
> > 
> > Have you verified it is safe to walk sptes with parallel modifications
> > to them? Other than shadow page deletion which is in theory covered by
> > RCU. It would be good to have all cases written down.
> > 
> 
> Um, i think it is safe, when spte is being write, we can get the old spte
> or the new spte, both cases are ok for us: it is just like the page structure
> cache on the real machine, the cpu can fetch the old spte even if the page
> structure is changed by other cpus.
> >> +
> >> +		if (!is_shadow_present_pte(spte))
> >> +			break;
> >> +	}
> >> +	walk_shadow_page_lockless_end(vcpu);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +/*
> >> + * If it is a real mmio page fault, return 1 and emulat the instruction
> >> + * directly, return 0 if it needs page fault path to fix it, -1 is
> >> + * returned if bug is detected.
> >> + */
> >> +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
> >> +{
> >> +	u64 spte;
> >> +
> >> +	if (quickly_check_mmio_pf(vcpu, addr, direct))
> >> +		return 1;
> >> +
> >> +	spte = walk_shadow_page_get_mmio_spte(vcpu, addr);
> >> +
> >> +	if (is_mmio_spte(spte)) {
> >> +		gfn_t gfn = get_mmio_spte_gfn(spte);
> >> +		unsigned access = get_mmio_spte_access(spte);
> >> +
> >> +		if (direct)
> >> +			addr = 0;
> >> +		vcpu_cache_mmio_info(vcpu, addr, gfn, access);
> >> +		return 1;
> >> +	}
> >> +
> >> +	/*
> >> +	 * It's ok if the gva is remapped by other cpus on shadow guest,
> >> +	 * it's a BUG if the gfn is not a mmio page.
> >> +	 */
> > 
> > If the gva is remapped there should be no mmio page fault in the first
> > place.
> > 
> 
> For example, as follow case:
> 
> VCPU 0                                     VCPU 1
> gva is mapped to a mmio region
>                                         access gva
> remap gva to other page
>                                        emulate mmio access by kvm
>                                        (*the point we are discussing*)
> flush all tlb
> 
> In theory, it can happen.
> 
> >> +	if (direct && is_shadow_present_pte(spte))
> >> +		return -1;
> > 
> > An spte does not have to contain the present bit to generate a valid EPT
> > misconfiguration (and an spte dump is still required in that case).
> > Use !is_mmio_spte() instead.
> > 
> 
> We can not use !is_mmio_spte() here, since the shadow page can be zapped anytime,
> for example:
> 
> sp.spt[i] = mmio-spte
> 
>           VCPU 0                                    VCPU 1    
> Access sp.spte[i], ept misconfig is occurred
>                                                    delete sp
>                                    (if the number of shadow page is out of the limit
>                                     or page shrink is required, and other events...)
> 
> Walk shadow page out of the lock and get the
> non-present spte
> (*the point we are discussing*)

Then is_mmio_spte(non-present spte) == false, right? Point is that it
only sptes with precise mmio spte pattern should be considered mmio
sptes, otherwise consider a genuine EPT misconfiguration error (which
must be reported).

What about using fault code instead of spte as Avi suggested instead?

> So, the bug we can detect is: it is the mmio access but the spte is point to the normal
> page.
> 
> > 
> >> +
> >> +	/*
> >> +	 * If the page table is zapped by other cpus, let the page
> >> +	 * fault path to fix it.
> >> +	 */
> >> +	return 0;
> >> +}
> > 
> > I don't understand when would this happen, can you please explain?
> > 
> 
> The case is above :-)

No need to jump to page fault handler, can let CPU fault again on non
present spte.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-23 14:21       ` Marcelo Tosatti
@ 2011-06-23 17:55         ` Xiao Guangrong
  2011-06-23 20:13           ` Marcelo Tosatti
  2011-06-26  8:42           ` Avi Kivity
  0 siblings, 2 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-23 17:55 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 06/23/2011 10:21 PM, Marcelo Tosatti wrote:

>>> An spte does not have to contain the present bit to generate a valid EPT
>>> misconfiguration (and an spte dump is still required in that case).
>>> Use !is_mmio_spte() instead.
>>>
>>
>> We can not use !is_mmio_spte() here, since the shadow page can be zapped anytime,
>> for example:
>>
>> sp.spt[i] = mmio-spte
>>
>>           VCPU 0                                    VCPU 1    
>> Access sp.spte[i], ept misconfig is occurred
>>                                                    delete sp
>>                                    (if the number of shadow page is out of the limit
>>                                     or page shrink is required, and other events...)
>>
>> Walk shadow page out of the lock and get the
>> non-present spte
>> (*the point we are discussing*)
> 
> Then is_mmio_spte(non-present spte) == false, right? Point is that it
> only sptes with precise mmio spte pattern should be considered mmio
> sptes, otherwise consider a genuine EPT misconfiguration error (which
> must be reported).
> 

No, not all no mmio spte is considered a genuine EPT misconfig, as the above
case, we can get !is_mmio_spte(), but it is not the genuine EPT misconfig
since it is caused by shadow page zapped
 
> What about using fault code instead of spte as Avi suggested instead?
> 

Do you mean waking guest page table to get mmio gva/mmio gpa for softmmu instead
of walking shadow page table?

I think it is unsafe, since guest can change the mapping anytime, we can get the
wrong mmio gva/mmio gpa to mmio emulate, consider follow case:

gva is mapped to the mmio region, we set the reserved bits in the spte:

VCPU 0                                               VCPU 1
Access gva, reserved page fault is occurred
                                             map gva to the RAM region
Walking guest page table and get the RAM gpa           TLB flush
(*the point we are discussing*)

Then we can get the wrong gpa to mmio emulate, so
- VMM can detected the invalid mmio access
- the event is missed, it neither accesses the mmio region nor the RAM region,
  it is not as the real cpu does

Anyway, mmio spte is needed to detect bugs for hard mmu

>> So, the bug we can detect is: it is the mmio access but the spte is point to the normal
>> page.
>>
>>>
>>>> +
>>>> +	/*
>>>> +	 * If the page table is zapped by other cpus, let the page
>>>> +	 * fault path to fix it.
>>>> +	 */
>>>> +	return 0;
>>>> +}
>>>
>>> I don't understand when would this happen, can you please explain?
>>>
>>
>> The case is above :-)
> 
> No need to jump to page fault handler, can let CPU fault again on non
> present spte.
> 

It is a good idea, will do, thanks!
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-23 17:55         ` Xiao Guangrong
@ 2011-06-23 20:13           ` Marcelo Tosatti
  2011-06-24  2:04             ` Xiao Guangrong
  2011-06-26  8:42           ` Avi Kivity
  1 sibling, 1 reply; 62+ messages in thread
From: Marcelo Tosatti @ 2011-06-23 20:13 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Avi Kivity, LKML, KVM

On Fri, Jun 24, 2011 at 01:55:45AM +0800, Xiao Guangrong wrote:
> On 06/23/2011 10:21 PM, Marcelo Tosatti wrote:
> 
> >>> An spte does not have to contain the present bit to generate a valid EPT
> >>> misconfiguration (and an spte dump is still required in that case).
> >>> Use !is_mmio_spte() instead.
> >>>
> >>
> >> We can not use !is_mmio_spte() here, since the shadow page can be zapped anytime,
> >> for example:
> >>
> >> sp.spt[i] = mmio-spte
> >>
> >>           VCPU 0                                    VCPU 1    
> >> Access sp.spte[i], ept misconfig is occurred
> >>                                                    delete sp
> >>                                    (if the number of shadow page is out of the limit
> >>                                     or page shrink is required, and other events...)
> >>
> >> Walk shadow page out of the lock and get the
> >> non-present spte
> >> (*the point we are discussing*)
> > 
> > Then is_mmio_spte(non-present spte) == false, right? Point is that it
> > only sptes with precise mmio spte pattern should be considered mmio
> > sptes, otherwise consider a genuine EPT misconfiguration error (which
> > must be reported).
> > 
> 
> No, not all no mmio spte is considered a genuine EPT misconfig, as the above
> case, we can get !is_mmio_spte(), but it is not the genuine EPT misconfig
> since it is caused by shadow page zapped

I mean it must be

if (is_mmio_spte(spte))
    handle_mmio
if (spte == spte_not_present) /* race, let CPU refault */
    return
handle EPT misconf

> > What about using fault code instead of spte as Avi suggested instead?
> > 
> 
> Do you mean waking guest page table to get mmio gva/mmio gpa for softmmu instead
> of walking shadow page table?
> 
> I think it is unsafe, since guest can change the mapping anytime, we can get the
> wrong mmio gva/mmio gpa to mmio emulate, consider follow case:
> 
> gva is mapped to the mmio region, we set the reserved bits in the spte:
> 
> VCPU 0                                               VCPU 1
> Access gva, reserved page fault is occurred
>                                              map gva to the RAM region
> Walking guest page table and get the RAM gpa           TLB flush
> (*the point we are discussing*)
> 
> Then we can get the wrong gpa to mmio emulate, so
> - VMM can detected the invalid mmio access
> - the event is missed, it neither accesses the mmio region nor the RAM region,
>   it is not as the real cpu does
> 
> Anyway, mmio spte is needed to detect bugs for hard mmu

Actually i was thinking about EPT misconf, but there are no other fields
than GPA. 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-23 20:13           ` Marcelo Tosatti
@ 2011-06-24  2:04             ` Xiao Guangrong
  0 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-24  2:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 06/24/2011 04:13 AM, Marcelo Tosatti wrote:

>> No, not all no mmio spte is considered a genuine EPT misconfig, as the above
>> case, we can get !is_mmio_spte(), but it is not the genuine EPT misconfig
>> since it is caused by shadow page zapped
> 
> I mean it must be
> 
> if (is_mmio_spte(spte))
>     handle_mmio
> if (spte == spte_not_present) /* race, let CPU refault */
>     return
> handle EPT misconf
> 

The patch already did it as you say :p

if (is_mmio_spte(spte))
	return handle_mmio
if (spte_present(spte))
	return handle EPT misconfig

return page fault path /*I will fix it, let cpu refault instead*/


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-23 17:55         ` Xiao Guangrong
  2011-06-23 20:13           ` Marcelo Tosatti
@ 2011-06-26  8:42           ` Avi Kivity
  1 sibling, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2011-06-26  8:42 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/23/2011 08:55 PM, Xiao Guangrong wrote:
> Anyway, mmio spte is needed to detect bugs for hard mmu
>

Note, we have the equivalent problem with bypass_guest_pf=1 and it works 
fine.  We did have an issue with EPT due to a missing tlb flush, but we 
don't need to design the whole mmu around it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 01/22] KVM: MMU: fix walking shadow page table
  2011-06-22 17:13   ` Marcelo Tosatti
  2011-06-23  2:05     ` Xiao Guangrong
@ 2011-06-27  6:35     ` Xiao Guangrong
  1 sibling, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-27  6:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, LKML, KVM

On 06/23/2011 01:13 AM, Marcelo Tosatti wrote:
> On Wed, Jun 22, 2011 at 10:28:04PM +0800, Xiao Guangrong wrote:
>> Properly check the last mapping, and do not walk to the next level if last spte
>> is met
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
>> ---
>>  arch/x86/kvm/mmu.c |    9 +++++----
>>  1 files changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 9c629b5..f474e93 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1517,10 +1517,6 @@ static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
>>  	if (iterator->level < PT_PAGE_TABLE_LEVEL)
>>  		return false;
>>  
>> -	if (iterator->level == PT_PAGE_TABLE_LEVEL)
> 
> Change to >= PT_PAGE_TABLE_LEVEL, checks should be in shadow_walk_okay.
> 

Marcelo,

Sorry, i did not remember, we can not check the last spte in shadow_walk_okay,
otherwise the last spte is skipped in the loop. :-)




^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v3 21/22] KVM: MMU: mmio page fault support
  2011-06-22 14:36 ` [PATCH v2 21/22] KVM: MMU: mmio page fault support Xiao Guangrong
  2011-06-22 21:59   ` Marcelo Tosatti
@ 2011-06-27 11:00   ` Xiao Guangrong
  2011-06-29  9:22   ` [PATCH v2 " Avi Kivity
  2 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-27 11:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

The idea is from Avi:

| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)

When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly

Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/kvm/mmu.c         |  148 +++++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h         |    1 +
 arch/x86/kvm/paging_tmpl.h |   21 +++++--
 arch/x86/kvm/vmx.c         |   10 +++-
 arch/x86/kvm/x86.c         |    7 ++
 5 files changed, 173 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 636500b..1198f19 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
 static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
+static u64 __read_mostly shadow_mmio_mask = (0xffull << 49 | 1ULL);
+
+static void mmu_spte_set(u64 *sptep, u64 spte);
+
+static void mark_mmio_spte(u64 *sptep, u64 gfn, unsigned access)
+{
+	access &= ACC_WRITE_MASK | ACC_USER_MASK;
+
+	mmu_spte_set(sptep, shadow_mmio_mask | access | gfn << PAGE_SHIFT);
+}
+
+static bool is_mmio_spte(u64 spte)
+{
+	return (spte & shadow_mmio_mask) == shadow_mmio_mask;
+}
+
+static gfn_t get_mmio_spte_gfn(u64 spte)
+{
+	return (spte & ~shadow_mmio_mask) >> PAGE_SHIFT;
+}
+
+static unsigned get_mmio_spte_access(u64 spte)
+{
+	return (spte & ~shadow_mmio_mask) & ~PAGE_MASK;
+}
+
+static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access)
+{
+	if (unlikely(is_mmio_pfn(pfn))) {
+		mark_mmio_spte(sptep, gfn, access);
+		return true;
+	}
+
+	return false;
+}
 
 static inline u64 rsvd_bits(int s, int e)
 {
@@ -226,7 +261,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
 
 static int is_shadow_present_pte(u64 pte)
 {
-	return pte != 0ull;
+	return pte != 0ull && !is_mmio_spte(pte);
 }
 
 static int is_large_pte(u64 pte)
@@ -1748,7 +1783,8 @@ static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			child = page_header(pte & PT64_BASE_ADDR_MASK);
 			drop_parent_pte(child, spte);
 		}
-	}
+	} else if (is_mmio_spte(pte))
+		mmu_spte_clear_no_track(spte);
 
 	if (is_large_pte(pte))
 		--kvm->stat.lpages;
@@ -2123,6 +2159,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	u64 spte, entry = *sptep;
 	int ret = 0;
 
+	if (set_mmio_spte(sptep, gfn, pfn, pte_access))
+		return 0;
+
 	/*
 	 * We don't set the accessed bit, since we sometimes want to see
 	 * whether the guest actually used the pte (in order to detect
@@ -2258,6 +2297,9 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		kvm_mmu_flush_tlb(vcpu);
 	}
 
+	if (unlikely(is_mmio_spte(*sptep) && emulate))
+		*emulate = 1;
+
 	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
 	pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n",
 		 is_large_pte(*sptep)? "2MB" : "4kB",
@@ -2484,7 +2526,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 
 static bool mmu_invalid_pfn(pfn_t pfn)
 {
-	return unlikely(is_invalid_pfn(pfn) || is_mmio_pfn(pfn));
+	return unlikely(is_invalid_pfn(pfn));
 }
 
 static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
@@ -2498,11 +2540,8 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
 		goto exit;
 	}
 
-	if (unlikely(is_mmio_pfn(pfn))) {
+	if (unlikely(is_mmio_pfn(pfn)))
 		vcpu_cache_mmio_info(vcpu, gva, gfn, access);
-		*ret_val = 1;
-		goto exit;
-	}
 
 	ret = false;
 exit:
@@ -2816,6 +2855,77 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
 	return vcpu->arch.nested_mmu.translate_gpa(vcpu, vaddr, access);
 }
 
+static bool quickly_check_mmio_pf(struct kvm_vcpu *vcpu, u64 addr, bool direct)
+{
+	if (direct)
+		return vcpu_match_mmio_gpa(vcpu, addr);
+
+	return vcpu_match_mmio_gva(vcpu, addr);
+}
+
+static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
+{
+	struct kvm_shadow_walk_iterator iterator;
+	u64 spte = 0ull;
+
+	walk_shadow_page_lockless_begin(vcpu);
+	for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
+		if (!is_shadow_present_pte(spte))
+			break;
+	walk_shadow_page_lockless_end(vcpu);
+
+	return spte;
+}
+
+/*
+ * If it is a real mmio page fault, return 1 and emulat the instruction
+ * directly, return 0 to let CPU fault again on the address, -1 is
+ * returned if bug is detected.
+ */
+int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
+{
+	u64 spte;
+
+	if (quickly_check_mmio_pf(vcpu, addr, direct))
+		return 1;
+
+	spte = walk_shadow_page_get_mmio_spte(vcpu, addr);
+
+	if (is_mmio_spte(spte)) {
+		gfn_t gfn = get_mmio_spte_gfn(spte);
+		unsigned access = get_mmio_spte_access(spte);
+
+		if (direct)
+			addr = 0;
+		vcpu_cache_mmio_info(vcpu, addr, gfn, access);
+		return 1;
+	}
+
+	/*
+	 * It's ok if the gva is remapped by other cpus on shadow guest,
+	 * it's a BUG if the gfn is not a mmio page.
+	 */
+	if (direct && is_shadow_present_pte(spte))
+		return -1;
+
+	/*
+	 * If the page table is zapped by other cpus, let CPU fault again on
+	 * the address.
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(handle_mmio_page_fault_common);
+
+static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr,
+				  u32 error_code, bool direct)
+{
+	int ret;
+
+	ret = handle_mmio_page_fault_common(vcpu, addr, direct);
+	WARN_ON(ret < 0);
+	return ret;
+}
+
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 				u32 error_code, bool prefault)
 {
@@ -2823,6 +2933,10 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 	int r;
 
 	pgprintk("%s: gva %lx error %x\n", __func__, gva, error_code);
+
+	if (unlikely(error_code & PFERR_RSVD_MASK))
+		return handle_mmio_page_fault(vcpu, gva, error_code, true);
+
 	r = mmu_topup_memory_caches(vcpu);
 	if (r)
 		return r;
@@ -2899,6 +3013,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 	ASSERT(vcpu);
 	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
 
+	if (unlikely(error_code & PFERR_RSVD_MASK))
+		return handle_mmio_page_fault(vcpu, gpa, error_code, true);
+
 	r = mmu_topup_memory_caches(vcpu);
 	if (r)
 		return r;
@@ -2996,6 +3113,23 @@ static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
 	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
 }
 
+static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
+			   int *nr_present)
+{
+	if (unlikely(is_mmio_spte(*sptep))) {
+		if (gfn != get_mmio_spte_gfn(*sptep)) {
+			mmu_spte_clear_no_track(sptep);
+			return true;
+		}
+
+		(*nr_present)++;
+		mark_mmio_spte(sptep, gfn, access);
+		return true;
+	}
+
+	return false;
+}
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 05310b1..ebf09bf 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -49,6 +49,7 @@
 #define PFERR_FETCH_MASK (1U << 4)
 
 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
+int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 870be69..a3bfe3c 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -583,6 +583,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 
+	if (unlikely(error_code & PFERR_RSVD_MASK))
+		return handle_mmio_page_fault(vcpu, addr, error_code,
+					      mmu_is_nested(vcpu));
+
 	r = mmu_topup_memory_caches(vcpu);
 	if (r)
 		return r;
@@ -690,7 +694,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 					--vcpu->kvm->stat.lpages;
 				drop_spte(vcpu->kvm, sptep);
 				need_flush = 1;
-			}
+			} else if (is_mmio_spte(*sptep))
+				mmu_spte_clear_no_track(sptep);
 
 			break;
 		}
@@ -786,7 +791,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		gpa_t pte_gpa;
 		gfn_t gfn;
 
-		if (!is_shadow_present_pte(sp->spt[i]))
+		if (!sp->spt[i])
 			continue;
 
 		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
@@ -795,13 +800,18 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 					  sizeof(pt_element_t)))
 			return -EINVAL;
 
-		gfn = gpte_to_gfn(gpte);
-
 		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
 			vcpu->kvm->tlbs_dirty++;
 			continue;
 		}
 
+		gfn = gpte_to_gfn(gpte);
+		pte_access = sp->role.access;
+		pte_access &= FNAME(gpte_access)(vcpu, gpte, true);
+
+		if (sync_mmio_spte(&sp->spt[i], gfn, pte_access, &nr_present))
+			continue;
+
 		if (gfn != sp->gfns[i]) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			vcpu->kvm->tlbs_dirty++;
@@ -809,8 +819,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		}
 
 		nr_present++;
-		pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte,
-								  true);
+
 		host_writable = sp->spt[i] & SPTE_HOST_WRITEABLE;
 
 		set_spte(vcpu, &sp->spt[i], pte_access, 0, 0,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a644acb..c812ec5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4671,11 +4671,19 @@ static void ept_misconfig_inspect_spte(struct kvm_vcpu *vcpu, u64 spte,
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 {
 	u64 sptes[4];
-	int nr_sptes, i;
+	int nr_sptes, i, ret;
 	gpa_t gpa;
 
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 
+	ret = handle_mmio_page_fault_common(vcpu, gpa, true);
+	if (likely(ret == 1))
+		return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
+					      EMULATE_DONE;
+	if (unlikely(!ret))
+		return 1;
+
+	/* It is the real ept misconfig */
 	printk(KERN_ERR "EPT: Misconfiguration.\n");
 	printk(KERN_ERR "EPT: GPA: 0x%llx\n", gpa);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 78d5519..345a123 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6521,6 +6521,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	if (!kvm->arch.n_requested_mmu_pages)
 		nr_mmu_pages = kvm_mmu_calculate_mmu_pages(kvm);
 
+	/*
+	 * If the new memory slot is created, we need to clear all
+	 * mmio sptes.
+	 */
+	if (npages && old.base_gfn != mem->guest_phys_addr >> PAGE_SHIFT)
+		kvm_arch_flush_shadow(kvm);
+
 	spin_lock(&kvm->mmu_lock);
 	if (nr_mmu_pages)
 		kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary
  2011-06-22 14:29 ` [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary Xiao Guangrong
@ 2011-06-29  8:21   ` Avi Kivity
  2011-06-29 10:53     ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29  8:21 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/22/2011 05:29 PM, Xiao Guangrong wrote:
> If the range spans a boundary, the mmio access can be broke, fix it as
> write emulation.
>
> And we already get the guest physical address, so use it to read guest data
> directly to avoid walking guest page table again
>
> Signed-off-by: Xiao Guangrong<xiaoguangrong@cn.fujitsu.com>
> ---
>   arch/x86/kvm/x86.c |   41 ++++++++++++++++++++++++++++++++---------
>   1 files changed, 32 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0b803f0..eb27be4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3944,14 +3944,13 @@ out:
>   }
>   EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
>
> -static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
> -				  unsigned long addr,
> -				  void *val,
> -				  unsigned int bytes,
> -				  struct x86_exception *exception)
> +static int emulator_read_emulated_onepage(unsigned long addr,
> +					  void *val,
> +					  unsigned int bytes,
> +					  struct x86_exception *exception,
> +					  struct kvm_vcpu *vcpu)
>   {
> -	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> -	gpa_t                 gpa;
> +	gpa_t gpa;
>   	int handled;
>
>   	if (vcpu->mmio_read_completed) {
> @@ -3971,8 +3970,7 @@ static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
>   	if ((gpa&  PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
>   		goto mmio;
>
> -	if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception)
> -	    == X86EMUL_CONTINUE)
> +	if (!kvm_read_guest(vcpu->kvm, gpa, val, bytes))
>   		return X86EMUL_CONTINUE;

This doesn't perform the cpl check.

I suggest dropping this part for now and doing it later.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
  2011-06-22 14:29 ` [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code Xiao Guangrong
@ 2011-06-29  8:24   ` Avi Kivity
  2011-06-29 10:56     ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29  8:24 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/22/2011 05:29 PM, Xiao Guangrong wrote:
> Introduce vcpu_gva_to_gpa to translate the gva to gpa, we can use it
> to cleanup the code between read emulation and write emulation
>
> Signed-off-by: Xiao Guangrong<xiaoguangrong@cn.fujitsu.com>
> ---
>   arch/x86/kvm/x86.c |   38 +++++++++++++++++++++++++++++---------
>   1 files changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eb27be4..c29ef96 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3944,6 +3944,27 @@ out:
>   }
>   EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
>
> +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
> +			   gpa_t *gpa, struct x86_exception *exception,
> +			   bool write)
> +{
> +	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
> +
> +	if (write)
> +		access |= PFERR_WRITE_MASK;

Needs fetch as well so NX/SMEP can work.

> +
> +	*gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation
  2011-06-22 14:30 ` [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation Xiao Guangrong
@ 2011-06-29  8:37   ` Avi Kivity
  2011-06-29 10:59     ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29  8:37 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/22/2011 05:30 PM, Xiao Guangrong wrote:
> The operations of read emulation and write emulation are very similar, so we
> can abstract the operation of them, in larter patch, it is used to cleanup the
> same code
>
> Signed-off-by: Xiao Guangrong<xiaoguangrong@cn.fujitsu.com>
> ---
>   arch/x86/kvm/x86.c |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 72 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c29ef96..887714f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4056,6 +4056,78 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
>   	return 1;
>   }
>
> +struct read_write_emulator_ops {
> +	int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
> +				  int bytes);
> +	int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				  void *val, int bytes);
> +	int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
> +			       int bytes, void *val);
> +	int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				    void *val, int bytes);
> +	bool write;
> +};


Interesting!

This structure combines two unrelated operations, though.  One is the 
internals of the iteration on a virtual address that is split to various 
physical addresses.  The other is the interaction with userspace on mmio 
exits.  They should be split, but I think it's fine to do it in a later 
patch.  This series is long enough already.

I was also annoyed by the duplication.  They way I thought of fixing it 
is having gva_to_gpa() return two gpas, and having the access function 
accept gpa vectors.  The reason was so that we can implemented locked 
cross-page operations (which we now emulate as unlocked writes).

But I think we can do without it, and instead emulated locked cross-page 
ops by stalling all other vcpus while we write, or by unmapping the 
pages involved.  It isn't pretty but it doesn't need to be fast since 
it's a very rare operation.  So I think we can go with your approach.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path
  2011-06-22 14:31 ` [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path Xiao Guangrong
@ 2011-06-29  8:48   ` Avi Kivity
  2011-06-29 11:09     ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29  8:48 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/22/2011 05:31 PM, Xiao Guangrong wrote:
> If the page fault is caused by mmio, we can cache the mmio info, later, we do
> not need to walk guest page table and quickly know it is a mmio fault while we
> emulate the mmio instruction

Does this work if the mmio spans two pages?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-22 14:35 ` [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table Xiao Guangrong
@ 2011-06-29  9:16   ` Avi Kivity
  2011-06-29 11:16     ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29  9:16 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/22/2011 05:35 PM, Xiao Guangrong wrote:
> Use rcu to protect shadow pages table to be freed, so we can safely walk it,
> it should run fastly and is needed by mmio page fault
>

>   static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>   				    struct list_head *invalid_list)
>   {
> @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>
>   	kvm_flush_remote_tlbs(kvm);
>
> +	if (atomic_read(&kvm->arch.reader_counter)) {
> +		kvm_mmu_isolate_pages(invalid_list);
> +		sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> +		list_del_init(invalid_list);
> +		call_rcu(&sp->rcu, free_pages_rcu);
> +		return;
> +	}
> +

I think we should do this unconditionally.  The cost of ping-ponging the 
shared cache line containing reader_counter will increase with large smp 
counts.  On the other hand, zap_page is very rare, so it can be a little 
slower.  Also, less code paths = easier to understand.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-22 14:36 ` [PATCH v2 21/22] KVM: MMU: mmio page fault support Xiao Guangrong
  2011-06-22 21:59   ` Marcelo Tosatti
  2011-06-27 11:00   ` [PATCH v3 " Xiao Guangrong
@ 2011-06-29  9:22   ` Avi Kivity
  2011-06-29 12:28     ` Xiao Guangrong
  2 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29  9:22 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/22/2011 05:36 PM, Xiao Guangrong wrote:
> The idea is from Avi:
>
> | We could cache the result of a miss in an spte by using a reserved bit, and
> | checking the page fault error code (or seeing if we get an ept violation or
> | ept misconfiguration), so if we get repeated mmio on a page, we don't need to
> | search the slot list/tree.
> | (https://lkml.org/lkml/2011/2/22/221)
>
> When the page fault is caused by mmio, we cache the info in the shadow page
> table, and also set the reserved bits in the shadow page table, so if the mmio
> is caused again, we can quickly identify it and emulate it directly
>
> Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
> can be reduced by this feature, and also avoid walking guest page table for
> soft mmu.
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 1319050..e69a47a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
>   static u64 __read_mostly shadow_user_mask;
>   static u64 __read_mostly shadow_accessed_mask;
>   static u64 __read_mostly shadow_dirty_mask;
> +static u64 __read_mostly shadow_mmio_mask = (0xffull<<  49 | 1ULL);

One bit is shifted out.   And it will fail with 52-bit MAXPHYADDR.

Please in addition, set the xwr bits to an invalid pattern on EPT (there 
is an MSR which specifies which patterns are valid; for example 
execute-only or write-only are invalid).  If all patterns are valid AND 
MAXPHYADDR == 52, then just set the mask to 0 and it the optimization 
will be disabled.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 0/22] KVM: optimize for MMIO handled
  2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
                   ` (21 preceding siblings ...)
  2011-06-22 14:36 ` [PATCH v2 22/22] KVM: MMU: trace mmio page fault Xiao Guangrong
@ 2011-06-29  9:23 ` Avi Kivity
  22 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2011-06-29  9:23 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/22/2011 05:27 PM, Xiao Guangrong wrote:
> In this version, we fix the bugs in the v1:
> - fix broken read emulation spans a page boundary
> - fix invalid spte point is got if we walk shadow page table
>    out of the mmu lock
>
> And, we also introduce some rules to modify spte in this version,
> then it does not need to atomically clear/set spte on x86_32 host
> anymore, the performance report of x86_32 host is in the later section
>
> Avi,
>
> I have sampled the operation of lockless shadow page walking as below steps:
> - mark walk_shadow_page_get_mmio_spte as 'noinline'
> - do the netperf test, the client is on the guest(NIC is e1000) and the server
>    is on the host, it can generate large press of mmio access
> - using perf to sample it, and the result of 'perf report' is attached
>
> The ratio of walk_shadow_page_get_mmio_spte is 0.09%, the ratio of handle_ept_misconfig
> is 0.11%, the ratio of handle_mmio_page_fault_common is 0.07%
>
> I think it is acceptable, your opinion?
>

Yes.

The patchset scares me, but it is nice work!  Good optimization and good 
clean up.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary
  2011-06-29  8:21   ` Avi Kivity
@ 2011-06-29 10:53     ` Xiao Guangrong
  2011-06-29 11:19       ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 10:53 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 04:21 PM, Avi Kivity wrote:

>>
>> -    if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception)
>> -        == X86EMUL_CONTINUE)
>> +    if (!kvm_read_guest(vcpu->kvm, gpa, val, bytes))
>>           return X86EMUL_CONTINUE;
> 
> This doesn't perform the cpl check.
> 

Firstly, it calls kvm_mmu_gva_to_gpa_read to translate gva to gpa, and cpl
is checked in this function, it is not enough?

> I suggest dropping this part for now and doing it later.
> 

OK, i will post this part in the separate patchset. :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
  2011-06-29  8:24   ` Avi Kivity
@ 2011-06-29 10:56     ` Xiao Guangrong
  2011-06-29 11:09       ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 10:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 04:24 PM, Avi Kivity wrote:

>> +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
>> +               gpa_t *gpa, struct x86_exception *exception,
>> +               bool write)
>> +{
>> +    u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
>> +
>> +    if (write)
>> +        access |= PFERR_WRITE_MASK;
> 
> Needs fetch as well so NX/SMEP can work.
> 

This function is only used by read/write emulator, execute permission is
not needed for read/write, no?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation
  2011-06-29  8:37   ` Avi Kivity
@ 2011-06-29 10:59     ` Xiao Guangrong
  0 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 10:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 04:37 PM, Avi Kivity wrote:

>> +struct read_write_emulator_ops {
>> +    int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
>> +                  int bytes);
>> +    int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
>> +                  void *val, int bytes);
>> +    int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
>> +                   int bytes, void *val);
>> +    int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
>> +                    void *val, int bytes);
>> +    bool write;
>> +};
> 
> 
> Interesting!
> 
> This structure combines two unrelated operations, though.  One is the internals of the iteration on a virtual address that is split to various physical addresses.  The other is the interaction with userspace on mmio exits.  They should be split, but I think it's fine to do it in a later patch.  This series is long enough already.
> 
> I was also annoyed by the duplication.  They way I thought of fixing it is having gva_to_gpa() return two gpas, and having the access function accept gpa vectors.  The reason was so that we can implemented locked cross-page operations (which we now emulate as unlocked writes).
> 
> But I think we can do without it, and instead emulated locked cross-page ops by stalling all other vcpus while we write, or by unmapping the pages involved.  It isn't pretty but it doesn't need to be fast since it's a very rare operation.  So I think we can go with your approach.
> 

OK, i'll post it in the separate patchset, thanks, Avi.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path
  2011-06-29  8:48   ` Avi Kivity
@ 2011-06-29 11:09     ` Xiao Guangrong
  2011-06-29 11:10       ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 11:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 04:48 PM, Avi Kivity wrote:
> On 06/22/2011 05:31 PM, Xiao Guangrong wrote:
>> If the page fault is caused by mmio, we can cache the mmio info, later, we do
>> not need to walk guest page table and quickly know it is a mmio fault while we
>> emulate the mmio instruction
> 
> Does this work if the mmio spans two pages?
> 

If the mmio spans two pages, we already split the emulation into two parts,
and the mmio cache info is only matched for one page, so i thinks it works
well :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
  2011-06-29 10:56     ` Xiao Guangrong
@ 2011-06-29 11:09       ` Avi Kivity
  2011-06-29 11:26         ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29 11:09 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 01:56 PM, Xiao Guangrong wrote:
> On 06/29/2011 04:24 PM, Avi Kivity wrote:
>
> >>  +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
> >>  +               gpa_t *gpa, struct x86_exception *exception,
> >>  +               bool write)
> >>  +{
> >>  +    u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
> >>  +
> >>  +    if (write)
> >>  +        access |= PFERR_WRITE_MASK;
> >
> >  Needs fetch as well so NX/SMEP can work.
> >
>
> This function is only used by read/write emulator, execute permission is
> not needed for read/write, no?

It's not good to have a function which only implements the functionality 
partially.  It can later be misused.

You can pass the page-fault-error-code instead of the write parameter, I 
think it will be simpler.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path
  2011-06-29 11:09     ` Xiao Guangrong
@ 2011-06-29 11:10       ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2011-06-29 11:10 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 02:09 PM, Xiao Guangrong wrote:
> On 06/29/2011 04:48 PM, Avi Kivity wrote:
> >  On 06/22/2011 05:31 PM, Xiao Guangrong wrote:
> >>  If the page fault is caused by mmio, we can cache the mmio info, later, we do
> >>  not need to walk guest page table and quickly know it is a mmio fault while we
> >>  emulate the mmio instruction
> >
> >  Does this work if the mmio spans two pages?
> >
>
> If the mmio spans two pages, we already split the emulation into two parts,
> and the mmio cache info is only matched for one page, so i thinks it works
> well :-)

Ok, thanks.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29  9:16   ` Avi Kivity
@ 2011-06-29 11:16     ` Xiao Guangrong
  2011-06-29 11:18       ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 11:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 05:16 PM, Avi Kivity wrote:
> On 06/22/2011 05:35 PM, Xiao Guangrong wrote:
>> Use rcu to protect shadow pages table to be freed, so we can safely walk it,
>> it should run fastly and is needed by mmio page fault
>>
> 
>>   static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>                       struct list_head *invalid_list)
>>   {
>> @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>
>>       kvm_flush_remote_tlbs(kvm);
>>
>> +    if (atomic_read(&kvm->arch.reader_counter)) {
>> +        kvm_mmu_isolate_pages(invalid_list);
>> +        sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>> +        list_del_init(invalid_list);
>> +        call_rcu(&sp->rcu, free_pages_rcu);
>> +        return;
>> +    }
>> +
> 
> I think we should do this unconditionally.  The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts.  On the other hand, zap_page is very rare, so it can be a little slower.  Also, less code paths = easier to understand.
> 

On soft mmu, zap_page is very frequently, it can cause performance regression in my test.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29 11:16     ` Xiao Guangrong
@ 2011-06-29 11:18       ` Avi Kivity
  2011-06-29 11:50         ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29 11:18 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 02:16 PM, Xiao Guangrong wrote:
> >>  @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
> >>
> >>        kvm_flush_remote_tlbs(kvm);
> >>
> >>  +    if (atomic_read(&kvm->arch.reader_counter)) {
> >>  +        kvm_mmu_isolate_pages(invalid_list);
> >>  +        sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> >>  +        list_del_init(invalid_list);
> >>  +        call_rcu(&sp->rcu, free_pages_rcu);
> >>  +        return;
> >>  +    }
> >>  +
> >
> >  I think we should do this unconditionally.  The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts.  On the other hand, zap_page is very rare, so it can be a little slower.  Also, less code paths = easier to understand.
> >
>
> On soft mmu, zap_page is very frequently, it can cause performance regression in my test.

Any idea what the cause of the regression is?  It seems to me that 
simply deferring freeing shouldn't have a large impact.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary
  2011-06-29 10:53     ` Xiao Guangrong
@ 2011-06-29 11:19       ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2011-06-29 11:19 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 01:53 PM, Xiao Guangrong wrote:
> On 06/29/2011 04:21 PM, Avi Kivity wrote:
>
> >>
> >>  -    if (kvm_read_guest_virt(ctxt, addr, val, bytes, exception)
> >>  -        == X86EMUL_CONTINUE)
> >>  +    if (!kvm_read_guest(vcpu->kvm, gpa, val, bytes))
> >>            return X86EMUL_CONTINUE;
> >
> >  This doesn't perform the cpl check.
> >
>
> Firstly, it calls kvm_mmu_gva_to_gpa_read to translate gva to gpa, and cpl
> is checked in this function, it is not enough?

You are right, it is enough.  I don't know how I missed it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
  2011-06-29 11:09       ` Avi Kivity
@ 2011-06-29 11:26         ` Xiao Guangrong
  2011-06-29 11:26           ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 11:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 07:09 PM, Avi Kivity wrote:
> On 06/29/2011 01:56 PM, Xiao Guangrong wrote:
>> On 06/29/2011 04:24 PM, Avi Kivity wrote:
>>
>> >>  +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
>> >>  +               gpa_t *gpa, struct x86_exception *exception,
>> >>  +               bool write)
>> >>  +{
>> >>  +    u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
>> >>  +
>> >>  +    if (write)
>> >>  +        access |= PFERR_WRITE_MASK;
>> >
>> >  Needs fetch as well so NX/SMEP can work.
>> >
>>
>> This function is only used by read/write emulator, execute permission is
>> not needed for read/write, no?
> 
> It's not good to have a function which only implements the functionality partially.  It can later be misused.
> 
> You can pass the page-fault-error-code instead of the write parameter, I think it will be simpler.
> 

Actually, we will get the cache mmio info in this function, i think it is pure waste for other
access execpt mmio, what about change the function name to vcpu_gva_to_gpa_mmio?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
  2011-06-29 11:26         ` Xiao Guangrong
@ 2011-06-29 11:26           ` Avi Kivity
  2011-06-29 11:48             ` Gleb Natapov
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29 11:26 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 02:26 PM, Xiao Guangrong wrote:
> On 06/29/2011 07:09 PM, Avi Kivity wrote:
> >  On 06/29/2011 01:56 PM, Xiao Guangrong wrote:
> >>  On 06/29/2011 04:24 PM, Avi Kivity wrote:
> >>
> >>  >>   +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
> >>  >>   +               gpa_t *gpa, struct x86_exception *exception,
> >>  >>   +               bool write)
> >>  >>   +{
> >>  >>   +    u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
> >>  >>   +
> >>  >>   +    if (write)
> >>  >>   +        access |= PFERR_WRITE_MASK;
> >>  >
> >>  >   Needs fetch as well so NX/SMEP can work.
> >>  >
> >>
> >>  This function is only used by read/write emulator, execute permission is
> >>  not needed for read/write, no?
> >
> >  It's not good to have a function which only implements the functionality partially.  It can later be misused.
> >
> >  You can pass the page-fault-error-code instead of the write parameter, I think it will be simpler.
> >
>
> Actually, we will get the cache mmio info in this function, i think it is pure waste for other
> access execpt mmio, what about change the function name to vcpu_gva_to_gpa_mmio?

Not too happy, but ok.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code
  2011-06-29 11:26           ` Avi Kivity
@ 2011-06-29 11:48             ` Gleb Natapov
  0 siblings, 0 replies; 62+ messages in thread
From: Gleb Natapov @ 2011-06-29 11:48 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Marcelo Tosatti, LKML, KVM

On Wed, Jun 29, 2011 at 02:26:14PM +0300, Avi Kivity wrote:
> On 06/29/2011 02:26 PM, Xiao Guangrong wrote:
> >On 06/29/2011 07:09 PM, Avi Kivity wrote:
> >>  On 06/29/2011 01:56 PM, Xiao Guangrong wrote:
> >>>  On 06/29/2011 04:24 PM, Avi Kivity wrote:
> >>>
> >>>  >>   +static int vcpu_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
> >>>  >>   +               gpa_t *gpa, struct x86_exception *exception,
> >>>  >>   +               bool write)
> >>>  >>   +{
> >>>  >>   +    u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
> >>>  >>   +
> >>>  >>   +    if (write)
> >>>  >>   +        access |= PFERR_WRITE_MASK;
> >>>  >
> >>>  >   Needs fetch as well so NX/SMEP can work.
> >>>  >
> >>>
> >>>  This function is only used by read/write emulator, execute permission is
> >>>  not needed for read/write, no?
> >>
> >>  It's not good to have a function which only implements the functionality partially.  It can later be misused.
> >>
> >>  You can pass the page-fault-error-code instead of the write parameter, I think it will be simpler.
> >>
> >
> >Actually, we will get the cache mmio info in this function, i think it is pure waste for other
> >access execpt mmio, what about change the function name to vcpu_gva_to_gpa_mmio?
> 
> Not too happy, but ok.
> 
I do plan to add fetching from MMIO.

--
			Gleb.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29 11:18       ` Avi Kivity
@ 2011-06-29 11:50         ` Xiao Guangrong
  2011-06-29 12:18           ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 11:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 07:18 PM, Avi Kivity wrote:
> On 06/29/2011 02:16 PM, Xiao Guangrong wrote:
>> >>  @@ -1767,6 +1874,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>> >>
>> >>        kvm_flush_remote_tlbs(kvm);
>> >>
>> >>  +    if (atomic_read(&kvm->arch.reader_counter)) {
>> >>  +        kvm_mmu_isolate_pages(invalid_list);
>> >>  +        sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>> >>  +        list_del_init(invalid_list);
>> >>  +        call_rcu(&sp->rcu, free_pages_rcu);
>> >>  +        return;
>> >>  +    }
>> >>  +
>> >
>> >  I think we should do this unconditionally.  The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts.  On the other hand, zap_page is very rare, so it can be a little slower.  Also, less code paths = easier to understand.
>> >
>>
>> On soft mmu, zap_page is very frequently, it can cause performance regression in my test.
> 
> Any idea what the cause of the regression is?  It seems to me that simply deferring freeing shouldn't have a large impact.
> 

I guess it is because the page is freed too frequently, i have done the test, it shows
about 3219 pages is freed per second

Kernbench performance comparing:

the origin way: 3m27.723
free all shadow page in rcu context: 3m30.519

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29 11:50         ` Xiao Guangrong
@ 2011-06-29 12:18           ` Avi Kivity
  2011-06-29 12:28             ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29 12:18 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 02:50 PM, Xiao Guangrong wrote:
> >>  >
> >>  >   I think we should do this unconditionally.  The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts.  On the other hand, zap_page is very rare, so it can be a little slower.  Also, less code paths = easier to understand.
> >>  >
> >>
> >>  On soft mmu, zap_page is very frequently, it can cause performance regression in my test.
> >
> >  Any idea what the cause of the regression is?  It seems to me that simply deferring freeing shouldn't have a large impact.
> >
>
> I guess it is because the page is freed too frequently, i have done the test, it shows
> about 3219 pages is freed per second
>
> Kernbench performance comparing:
>
> the origin way: 3m27.723
> free all shadow page in rcu context: 3m30.519

I don't recall seeing such a high free rate.  Who is doing all this zapping?

You may be able to find out with the function tracer + call graph.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29 12:28             ` Xiao Guangrong
@ 2011-06-29 12:27               ` Avi Kivity
  2011-06-29 12:39                 ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29 12:27 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 03:28 PM, Xiao Guangrong wrote:
> On 06/29/2011 08:18 PM, Avi Kivity wrote:
> >  On 06/29/2011 02:50 PM, Xiao Guangrong wrote:
> >>  >>   >
> >>  >>   >    I think we should do this unconditionally.  The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts.  On the other hand, zap_page is very rare, so it can be a little slower.  Also, less code paths = easier to understand.
> >>  >>   >
> >>  >>
> >>  >>   On soft mmu, zap_page is very frequently, it can cause performance regression in my test.
> >>  >
> >>  >   Any idea what the cause of the regression is?  It seems to me that simply deferring freeing shouldn't have a large impact.
> >>  >
> >>
> >>  I guess it is because the page is freed too frequently, i have done the test, it shows
> >>  about 3219 pages is freed per second
> >>
> >>  Kernbench performance comparing:
> >>
> >>  the origin way: 3m27.723
> >>  free all shadow page in rcu context: 3m30.519
> >
> >  I don't recall seeing such a high free rate.  Who is doing all this zapping?
> >
> >  You may be able to find out with the function tracer + call graph.
> >
>
> I looked into it before, it is caused by "write flood" detected, i also noticed
> some pages are zapped and allocation again and again, maybe we need to improve
> the algorithm of detecting "write flood".

Ok.  Let's drop the two paths, and put this improvement on the TODO instead.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29 12:18           ` Avi Kivity
@ 2011-06-29 12:28             ` Xiao Guangrong
  2011-06-29 12:27               ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 12:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 08:18 PM, Avi Kivity wrote:
> On 06/29/2011 02:50 PM, Xiao Guangrong wrote:
>> >>  >
>> >>  >   I think we should do this unconditionally.  The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts.  On the other hand, zap_page is very rare, so it can be a little slower.  Also, less code paths = easier to understand.
>> >>  >
>> >>
>> >>  On soft mmu, zap_page is very frequently, it can cause performance regression in my test.
>> >
>> >  Any idea what the cause of the regression is?  It seems to me that simply deferring freeing shouldn't have a large impact.
>> >
>>
>> I guess it is because the page is freed too frequently, i have done the test, it shows
>> about 3219 pages is freed per second
>>
>> Kernbench performance comparing:
>>
>> the origin way: 3m27.723
>> free all shadow page in rcu context: 3m30.519
> 
> I don't recall seeing such a high free rate.  Who is doing all this zapping?
> 
> You may be able to find out with the function tracer + call graph.
> 

I looked into it before, it is caused by "write flood" detected, i also noticed
some pages are zapped and allocation again and again, maybe we need to improve
the algorithm of detecting "write flood".

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 21/22] KVM: MMU: mmio page fault support
  2011-06-29  9:22   ` [PATCH v2 " Avi Kivity
@ 2011-06-29 12:28     ` Xiao Guangrong
  0 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 12:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 05:22 PM, Avi Kivity wrote:
> On 06/22/2011 05:36 PM, Xiao Guangrong wrote:
>> The idea is from Avi:
>>
>> | We could cache the result of a miss in an spte by using a reserved bit, and
>> | checking the page fault error code (or seeing if we get an ept violation or
>> | ept misconfiguration), so if we get repeated mmio on a page, we don't need to
>> | search the slot list/tree.
>> | (https://lkml.org/lkml/2011/2/22/221)
>>
>> When the page fault is caused by mmio, we cache the info in the shadow page
>> table, and also set the reserved bits in the shadow page table, so if the mmio
>> is caused again, we can quickly identify it and emulate it directly
>>
>> Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
>> can be reduced by this feature, and also avoid walking guest page table for
>> soft mmu.
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 1319050..e69a47a 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -197,6 +197,41 @@ static u64 __read_mostly shadow_x_mask;    /* mutual exclusive with nx_mask */
>>   static u64 __read_mostly shadow_user_mask;
>>   static u64 __read_mostly shadow_accessed_mask;
>>   static u64 __read_mostly shadow_dirty_mask;
>> +static u64 __read_mostly shadow_mmio_mask = (0xffull<<  49 | 1ULL);
> 
> One bit is shifted out.   And it will fail with 52-bit MAXPHYADDR.
> 
> Please in addition, set the xwr bits to an invalid pattern on EPT (there is an MSR which specifies which patterns are valid; for example execute-only or write-only are invalid).  If all patterns are valid AND MAXPHYADDR == 52, then just set the mask to 0 and it the optimization will be disabled.
> 

OK, will fix, thanks!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29 12:27               ` Avi Kivity
@ 2011-06-29 12:39                 ` Xiao Guangrong
  2011-06-29 13:01                   ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 12:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 08:27 PM, Avi Kivity wrote:
> On 06/29/2011 03:28 PM, Xiao Guangrong wrote:
>> On 06/29/2011 08:18 PM, Avi Kivity wrote:
>> >  On 06/29/2011 02:50 PM, Xiao Guangrong wrote:
>> >>  >>   >
>> >>  >>   >    I think we should do this unconditionally.  The cost of ping-ponging the shared cache line containing reader_counter will increase with large smp counts.  On the other hand, zap_page is very rare, so it can be a little slower.  Also, less code paths = easier to understand.
>> >>  >>   >
>> >>  >>
>> >>  >>   On soft mmu, zap_page is very frequently, it can cause performance regression in my test.
>> >>  >
>> >>  >   Any idea what the cause of the regression is?  It seems to me that simply deferring freeing shouldn't have a large impact.
>> >>  >
>> >>
>> >>  I guess it is because the page is freed too frequently, i have done the test, it shows
>> >>  about 3219 pages is freed per second
>> >>
>> >>  Kernbench performance comparing:
>> >>
>> >>  the origin way: 3m27.723
>> >>  free all shadow page in rcu context: 3m30.519
>> >
>> >  I don't recall seeing such a high free rate.  Who is doing all this zapping?
>> >
>> >  You may be able to find out with the function tracer + call graph.
>> >
>>
>> I looked into it before, it is caused by "write flood" detected, i also noticed
>> some pages are zapped and allocation again and again, maybe we need to improve
>> the algorithm of detecting "write flood".
> 
> Ok.  Let's drop the two paths, and put this improvement on the TODO instead.
> 

Avi, i am sorry, i do not understand it clearly, it means keep the patch as the
original way and do the improvement after it merged?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29 12:39                 ` Xiao Guangrong
@ 2011-06-29 13:01                   ` Avi Kivity
  2011-06-29 13:05                     ` Xiao Guangrong
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2011-06-29 13:01 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 03:39 PM, Xiao Guangrong wrote:
> >
> >  Ok.  Let's drop the two paths, and put this improvement on the TODO instead.
> >
>
> Avi, i am sorry, i do not understand it clearly, it means keep the patch as the
> original way and do the improvement after it merged?

I mean, do all freeing using RCU and improve fork detection after 
merge.  I don't like the extra complexity and the extra atomic ops.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table
  2011-06-29 13:01                   ` Avi Kivity
@ 2011-06-29 13:05                     ` Xiao Guangrong
  0 siblings, 0 replies; 62+ messages in thread
From: Xiao Guangrong @ 2011-06-29 13:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, LKML, KVM

On 06/29/2011 09:01 PM, Avi Kivity wrote:
> On 06/29/2011 03:39 PM, Xiao Guangrong wrote:
>> >
>> >  Ok.  Let's drop the two paths, and put this improvement on the TODO instead.
>> >
>>
>> Avi, i am sorry, i do not understand it clearly, it means keep the patch as the
>> original way and do the improvement after it merged?
> 
> I mean, do all freeing using RCU and improve fork detection after merge.  I don't like the extra complexity and the extra atomic ops.
> 

Oh, i see, thanks! :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2011-06-29 13:03 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-22 14:27 [PATCH v2 0/22] KVM: optimize for MMIO handled Xiao Guangrong
2011-06-22 14:28 ` [PATCH v2 01/22] KVM: MMU: fix walking shadow page table Xiao Guangrong
2011-06-22 17:13   ` Marcelo Tosatti
2011-06-23  2:05     ` Xiao Guangrong
2011-06-27  6:35     ` Xiao Guangrong
2011-06-22 14:28 ` [PATCH v2 02/22] KVM: MMU: do not update slot bitmap if spte is nonpresent Xiao Guangrong
2011-06-22 14:29 ` [PATCH v2 03/22] KVM: x86: fix broken read emulation spans a page boundary Xiao Guangrong
2011-06-29  8:21   ` Avi Kivity
2011-06-29 10:53     ` Xiao Guangrong
2011-06-29 11:19       ` Avi Kivity
2011-06-22 14:29 ` [PATCH v2 04/22] KVM: x86: introduce vcpu_gva_to_gpa to cleanup the code Xiao Guangrong
2011-06-29  8:24   ` Avi Kivity
2011-06-29 10:56     ` Xiao Guangrong
2011-06-29 11:09       ` Avi Kivity
2011-06-29 11:26         ` Xiao Guangrong
2011-06-29 11:26           ` Avi Kivity
2011-06-29 11:48             ` Gleb Natapov
2011-06-22 14:30 ` [PATCH v2 05/22] KVM: x86: abstract the operation for read/write emulation Xiao Guangrong
2011-06-29  8:37   ` Avi Kivity
2011-06-29 10:59     ` Xiao Guangrong
2011-06-22 14:30 ` [PATCH v2 06/22] KVM: x86: cleanup the code of " Xiao Guangrong
2011-06-22 14:31 ` [PATCH v2 07/22] KVM: MMU: cache mmio info on page fault path Xiao Guangrong
2011-06-29  8:48   ` Avi Kivity
2011-06-29 11:09     ` Xiao Guangrong
2011-06-29 11:10       ` Avi Kivity
2011-06-22 14:31 ` [PATCH v2 08/22] KVM: MMU: optimize to handle dirty bit Xiao Guangrong
2011-06-22 14:31 ` [PATCH v2 09/22] KVM: MMU: cleanup for FNAME(fetch) Xiao Guangrong
2011-06-22 14:32 ` [PATCH v2 10/22] KVM: MMU: rename 'pt_write' to 'emulate' Xiao Guangrong
2011-06-22 14:32 ` [PATCH v2 11/22] KVM: MMU: count used shadow pages on prepareing path Xiao Guangrong
2011-06-22 14:32 ` [PATCH v2 12/22] KVM: MMU: split kvm_mmu_free_page Xiao Guangrong
2011-06-22 14:33 ` [PATCH v2 13/22] KVM: MMU: remove bypass_guest_pf Xiao Guangrong
2011-06-22 14:33 ` [PATCH v2 14/22] KVM: MMU: filter out the mmio pfn from the fault pfn Xiao Guangrong
2011-06-22 14:34 ` [PATCH v2 15/22] KVM: MMU: abstract some functions to handle " Xiao Guangrong
2011-06-22 14:34 ` [PATCH v2 16/22] KVM: MMU: introduce the rules to modify shadow page table Xiao Guangrong
2011-06-22 14:34 ` [PATCH v2 17/22] KVM: MMU: clean up spte updating and clearing Xiao Guangrong
2011-06-22 14:35 ` [PATCH 18/22] KVM: MMU: do not need atomicly to set/clear spte Xiao Guangrong
2011-06-22 14:35 ` [PATCH v2 19/22] KVM: MMU: lockless walking shadow page table Xiao Guangrong
2011-06-29  9:16   ` Avi Kivity
2011-06-29 11:16     ` Xiao Guangrong
2011-06-29 11:18       ` Avi Kivity
2011-06-29 11:50         ` Xiao Guangrong
2011-06-29 12:18           ` Avi Kivity
2011-06-29 12:28             ` Xiao Guangrong
2011-06-29 12:27               ` Avi Kivity
2011-06-29 12:39                 ` Xiao Guangrong
2011-06-29 13:01                   ` Avi Kivity
2011-06-29 13:05                     ` Xiao Guangrong
2011-06-22 14:35 ` [PATCH v2 20/22] KVM: MMU: reorganize struct kvm_shadow_walk_iterator Xiao Guangrong
2011-06-22 14:36 ` [PATCH v2 21/22] KVM: MMU: mmio page fault support Xiao Guangrong
2011-06-22 21:59   ` Marcelo Tosatti
2011-06-23  3:19     ` Xiao Guangrong
2011-06-23  6:40       ` Xiao Guangrong
2011-06-23 14:21       ` Marcelo Tosatti
2011-06-23 17:55         ` Xiao Guangrong
2011-06-23 20:13           ` Marcelo Tosatti
2011-06-24  2:04             ` Xiao Guangrong
2011-06-26  8:42           ` Avi Kivity
2011-06-27 11:00   ` [PATCH v3 " Xiao Guangrong
2011-06-29  9:22   ` [PATCH v2 " Avi Kivity
2011-06-29 12:28     ` Xiao Guangrong
2011-06-22 14:36 ` [PATCH v2 22/22] KVM: MMU: trace mmio page fault Xiao Guangrong
2011-06-29  9:23 ` [PATCH v2 0/22] KVM: optimize for MMIO handled Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.