summaryrefslogtreecommitdiffstats
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-04-29KVM: VMX: Enable machine check support for 32bit targetsUros Bizjak
commit fb56baae5ea509e63c2a068d66a4d8ea91969fca upstream. There is no reason to limit the use of do_machine_check to 64bit targets. MCE handling works for both target familes. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Fixes: a0861c02a981 ("KVM: Add VT-x machine check support") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200414071414.45636-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29x86/KVM: Clean up host's steal time structureBoris Ostrovsky
commit a6bd811f1209fe1c64c9f6fd578101d6436c6b6e upstream. Now that we are mapping kvm_steal_time from the guest directly we don't need keep a copy of it in kvm_vcpu_arch.st. The same is true for the stime field. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missedBoris Ostrovsky
commit b043138246a41064527cf019a3d51d9f015e9796 upstream. There is a potential race in record_steal_time() between setting host-local vcpu->arch.st.steal.preempted to zero (i.e. clearing KVM_VCPU_PREEMPTED) and propagating this value to the guest with kvm_write_guest_cached(). Between those two events the guest may still see KVM_VCPU_PREEMPTED in its copy of kvm_steal_time, set KVM_VCPU_FLUSH_TLB and assume that hypervisor will do the right thing. Which it won't. Instad of copying, we should map kvm_steal_time and that will guarantee atomicity of accesses to @preempted. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 4.19: No tracepoint in record_steal_time().] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29x86/kvm: Cache gfn to pfn translationBoris Ostrovsky
commit 917248144db5d7320655dbb41d3af0b8a0f3d589 upstream. __kvm_map_gfn()'s call to gfn_to_pfn_memslot() is * relatively expensive * in certain cases (such as when done from atomic context) cannot be called Stashing gfn-to-pfn mapping should help with both cases. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29KVM: nVMX: Always sync GUEST_BNDCFGS when it comes from vmcs01Sean Christopherson
commit 3b013a2972d5bc344d6eaa8f24fdfe268211e45f upstream. If L1 does not set VM_ENTRY_LOAD_BNDCFGS, then L1's BNDCFGS value must be propagated to vmcs02 since KVM always runs with VM_ENTRY_LOAD_BNDCFGS when MPX is supported. Because the value effectively comes from vmcs01, vmcs02 must be updated even if vmcs12 is clean. Fixes: 62cf9bd8118c4 ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS") Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 4.19: adjust filename, context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29KVM: VMX: Zero out *all* general purpose registers after VM-ExitSean Christopherson
commit 0e0ab73c9a0243736bcd779b30b717e23ba9a56d upstream. ...except RSP, which is restored by hardware as part of VM-Exit. Paolo theorized that restoring registers from the stack after a VM-Exit in lieu of zeroing them could lead to speculative execution with the guest's values, e.g. if the stack accesses miss the L1 cache[1]. Zeroing XORs are dirt cheap, so just be ultra-paranoid. Note that the scratch register (currently RCX) used to save/restore the guest state is also zeroed as its host-defined value is loaded via the stack, just with a MOV instead of a POP. [1] https://patchwork.kernel.org/patch/10771539/#22441255 Fixes: 0cb5b30698fd ("kvm: vmx: Scrub hardware GPRs at VM-exit") Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 4.19: adjust filename, context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23x86: ACPI: fix CPU hotplug deadlockQian Cai
[ Upstream commit 696ac2e3bf267f5a2b2ed7d34e64131f2287d0ad ] Similar to commit 0266d81e9bf5 ("acpi/processor: Prevent cpu hotplug deadlock") except this is for acpi_processor_ffh_cstate_probe(): "The problem is that the work is scheduled on the current CPU from the hotplug thread associated with that CPU. It's not required to invoke these functions via the workqueue because the hotplug thread runs on the target CPU already. Check whether current is a per cpu thread pinned on the target CPU and invoke the function directly to avoid the workqueue." WARNING: possible circular locking dependency detected ------------------------------------------------------ cpuhp/1/15 is trying to acquire lock: ffffc90003447a28 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x4c6/0x630 but task is already holding lock: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x3e/0xc0 irq_calc_affinity_vectors+0x5f/0x91 __pci_enable_msix_range+0x10f/0x9a0 pci_alloc_irq_vectors_affinity+0x13e/0x1f0 pci_alloc_irq_vectors_affinity at drivers/pci/msi.c:1208 pqi_ctrl_init+0x72f/0x1618 [smartpqi] pqi_pci_probe.cold.63+0x882/0x892 [smartpqi] local_pci_probe+0x7a/0xc0 work_for_cpu_fn+0x2e/0x50 process_one_work+0x57e/0xb90 worker_thread+0x363/0x5b0 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __lock_acquire+0x2244/0x32a0 lock_acquire+0x1a2/0x680 __flush_work+0x4e6/0x630 work_on_cpu+0x114/0x160 acpi_processor_ffh_cstate_probe+0x129/0x250 acpi_processor_evaluate_cst+0x4c8/0x580 acpi_processor_get_power_info+0x86/0x740 acpi_processor_hotplug+0xc3/0x140 acpi_soft_cpu_online+0x102/0x1d0 cpuhp_invoke_callback+0x197/0x1120 cpuhp_thread_fun+0x252/0x2f0 smpboot_thread_fn+0x255/0x440 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 other info that might help us debug this: Chain exists of: (work_completion)(&wfc.work) --> cpuhp_state-up --> cpuidle_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuidle_lock); lock(cpuhp_state-up); lock(cpuidle_lock); lock((work_completion)(&wfc.work)); *** DEADLOCK *** 3 locks held by cpuhp/1/15: #0: ffffffffaf51ab10 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0 #1: ffffffffaf51ad40 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0 #2: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20 Call Trace: dump_stack+0xa0/0xea print_circular_bug.cold.52+0x147/0x14c check_noncircular+0x295/0x2d0 __lock_acquire+0x2244/0x32a0 lock_acquire+0x1a2/0x680 __flush_work+0x4e6/0x630 work_on_cpu+0x114/0x160 acpi_processor_ffh_cstate_probe+0x129/0x250 acpi_processor_evaluate_cst+0x4c8/0x580 acpi_processor_get_power_info+0x86/0x740 acpi_processor_hotplug+0xc3/0x140 acpi_soft_cpu_online+0x102/0x1d0 cpuhp_invoke_callback+0x197/0x1120 cpuhp_thread_fun+0x252/0x2f0 smpboot_thread_fn+0x255/0x440 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 Signed-off-by: Qian Cai <cai@lca.pw> Tested-by: Borislav Petkov <bp@suse.de> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23x86/Hyper-V: Report crash data in die() when panic_on_oops is setTianyu Lan
[ Upstream commit f3a99e761efa616028b255b4de58e9b5b87c5545 ] When oops happens with panic_on_oops unset, the oops thread is killed by die() and system continues to run. In such case, guest should not report crash register data to host since system still runs. Check panic_on_oops and return directly in hyperv_report_panic() when the function is called in the die() and panic_on_oops is unset. Fix it. Fixes: 7ed4325a44ea ("Drivers: hv: vmbus: Make panic reporting to be more useful") Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20200406155331.2105-7-Tianyu.Lan@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23x86/Hyper-V: Report crash register data or kmsg before running crash kernelTianyu Lan
commit a11589563e96bf262767294b89b25a9d44e7303b upstream. We want to notify Hyper-V when a Linux guest VM crash occurs, so there is a record of the crash even when kdump is enabled. But crash_kexec_post_notifiers defaults to "false", so the kdump kernel runs before the notifiers and Hyper-V never gets notified. Fix this by always setting crash_kexec_post_notifiers to be true for Hyper-V VMs. Fixes: 81b18bce48af ("Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic") Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Link: https://lore.kernel.org/r/20200406155331.2105-5-Tianyu.Lan@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21x86/resctrl: Fix invalid attempt at removing the default resource groupReinette Chatre
commit b0151da52a6d4f3951ea24c083e7a95977621436 upstream. The default resource group ("rdtgroup_default") is associated with the root of the resctrl filesystem and should never be removed. New resource groups can be created as subdirectories of the resctrl filesystem and they can be removed from user space. There exists a safeguard in the directory removal code (rdtgroup_rmdir()) that ensures that only subdirectories can be removed by testing that the directory to be removed has to be a child of the root directory. A possible deadlock was recently fixed with 334b0f4e9b1b ("x86/resctrl: Fix a deadlock due to inaccurate reference"). This fix involved associating the private data of the "mon_groups" and "mon_data" directories to the resource group to which they belong instead of NULL as before. A consequence of this change was that the original safeguard code preventing removal of "mon_groups" and "mon_data" found in the root directory failed resulting in attempts to remove the default resource group that ends in a BUG: kernel BUG at mm/slub.c:3969! invalid opcode: 0000 [#1] SMP PTI Call Trace: rdtgroup_rmdir+0x16b/0x2c0 kernfs_iop_rmdir+0x5c/0x90 vfs_rmdir+0x7a/0x160 do_rmdir+0x17d/0x1e0 do_syscall_64+0x55/0x1d0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by improving the directory removal safeguard to ensure that subdirectories of the resctrl root directory can only be removed if they are a child of the resctrl filesystem's root _and_ not associated with the default resource group. Fixes: 334b0f4e9b1b ("x86/resctrl: Fix a deadlock due to inaccurate reference") Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/884cbe1773496b5dbec1b6bd11bb50cffa83603d.1584461853.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21x86/resctrl: Preserve CDP enable over CPU hotplugJames Morse
commit 9fe0450785abbc04b0ed5d3cf61fcdb8ab656b4b upstream. Resctrl assumes that all CPUs are online when the filesystem is mounted, and that CPUs remember their CDP-enabled state over CPU hotplug. This goes wrong when resctrl's CDP-enabled state changes while all the CPUs in a domain are offline. When a domain comes online, enable (or disable!) CDP to match resctrl's current setting. Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support") Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200221162105.154163-1-james.morse@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21x86/microcode/AMD: Increase microcode PATCH_MAX_SIZEJohn Allen
commit bdf89df3c54518eed879d8fac7577fcfb220c67e upstream. Future AMD CPUs will have microcode patches that exceed the default 4K patch size. Raise our limit. Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # v4.14.. Link: https://lkml.kernel.org/r/20200409152931.GA685273@mojo.amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21kvm: x86: Host feature SSBD doesn't imply guest feature SPEC_CTRL_SSBDJim Mattson
commit 396d2e878f92ec108e4293f1c77ea3bc90b414ff upstream. The host reports support for the synthetic feature X86_FEATURE_SSBD when any of the three following hardware features are set: CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31] CPUID.80000008H:EBX.AMD_SSBD[bit 24] CPUID.80000008H:EBX.VIRT_SSBD[bit 25] Either of the first two hardware features implies the existence of the IA32_SPEC_CTRL MSR, but CPUID.80000008H:EBX.VIRT_SSBD[bit 25] does not. Therefore, CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31] should only be set in the guest if CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31] or CPUID.80000008H:EBX.AMD_SSBD[bit 24] is set on the host. Fixes: 0c54914d0c52a ("KVM: x86: use Intel speculation bugs and features as derived in generic x86 code") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Reviewed-by: Peter Shier <pshier@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reported-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 4.x: adjust indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17efi/x86: Fix the deletion of variables in mixed modeGary Lin
[ Upstream commit a4b81ccfd4caba017d2b84720b6de4edd16911a0 ] efi_thunk_set_variable() treated the NULL "data" pointer as an invalid parameter, and this broke the deletion of variables in mixed mode. This commit fixes the check of data so that the userspace program can delete a variable in mixed mode. Fixes: 8319e9d5ad98ffcc ("efi/x86: Handle by-ref arguments covering multiple pages in mixed mode") Signed-off-by: Gary Lin <glin@suse.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200408081606.1504-1-glin@suse.com Link: https://lore.kernel.org/r/20200409130434.6736-9-ardb@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17KVM: VMX: fix crash cleanup when KVM wasn't usedVitaly Kuznetsov
commit dbef2808af6c594922fe32833b30f55f35e9da6d upstream. If KVM wasn't used at all before we crash the cleanup procedure fails with BUG: unable to handle page fault for address: ffffffffffffffc8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 23215067 P4D 23215067 PUD 23217067 PMD 0 Oops: 0000 [#8] SMP PTI CPU: 0 PID: 3542 Comm: bash Kdump: loaded Tainted: G D 5.6.0-rc2+ #823 RIP: 0010:crash_vmclear_local_loaded_vmcss.cold+0x19/0x51 [kvm_intel] The root cause is that loaded_vmcss_on_cpu list is not yet initialized, we initialize it in hardware_enable() but this only happens when we start a VM. Previously, we used to have a bitmap with enabled CPUs and that was preventing [masking] the issue. Initialized loaded_vmcss_on_cpu list earlier, right before we assign crash_vmclear_loaded_vmcss pointer. blocked_vcpu_on_cpu list and blocked_vcpu_on_cpu_lock are moved altogether for consistency. Fixes: 31603d4fc2bb ("KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200401081348.1345307-1-vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: x86: Gracefully handle __vmalloc() failure during VM allocationSean Christopherson
commit d18b2f43b9147c8005ae0844fb445d8cc6a87e31 upstream. Check the result of __vmalloc() to avoid dereferencing a NULL pointer in the event that allocation failres. Fixes: d1e5b0e98ea27 ("kvm: Make VM ioctl do valloc for some archs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec supportSean Christopherson
commit 31603d4fc2bb4f0815245d496cb970b27b4f636a upstream. VMCLEAR all in-use VMCSes during a crash, even if kdump's NMI shootdown interrupted a KVM update of the percpu in-use VMCS list. Because NMIs are not blocked by disabling IRQs, it's possible that crash_vmclear_local_loaded_vmcss() could be called while the percpu list of VMCSes is being modified, e.g. in the middle of list_add() in vmx_vcpu_load_vmcs(). This potential corner case was called out in the original commit[*], but the analysis of its impact was wrong. Skipping the VMCLEARs is wrong because it all but guarantees that a loaded, and therefore cached, VMCS will live across kexec and corrupt memory in the new kernel. Corruption will occur because the CPU's VMCS cache is non-coherent, i.e. not snooped, and so the writeback of VMCS memory on its eviction will overwrite random memory in the new kernel. The VMCS will live because the NMI shootdown also disables VMX, i.e. the in-progress VMCLEAR will #UD, and existing Intel CPUs do not flush the VMCS cache on VMXOFF. Furthermore, interrupting list_add() and list_del() is safe due to crash_vmclear_local_loaded_vmcss() using forward iteration. list_add() ensures the new entry is not visible to forward iteration unless the entire add completes, via WRITE_ONCE(prev->next, new). A bad "prev" pointer could be observed if the NMI shootdown interrupted list_del() or list_add(), but list_for_each_entry() does not consume ->prev. In addition to removing the temporary disabling of VMCLEAR, open code loaded_vmcs_init() in __loaded_vmcs_clear() and reorder VMCLEAR so that the VMCS is deleted from the list only after it's been VMCLEAR'd. Deleting the VMCS before VMCLEAR would allow a race where the NMI shootdown could arrive between list_del() and vmcs_clear() and thus neither flow would execute a successful VMCLEAR. Alternatively, more code could be moved into loaded_vmcs_init(), but that gets rather silly as the only other user, alloc_loaded_vmcs(), doesn't need the smp_wmb() and would need to work around the list_del(). Update the smp_*() comments related to the list manipulation, and opportunistically reword them to improve clarity. [*] https://patchwork.kernel.org/patch/1675731/#3720461 Fixes: 8f536b7697a0 ("KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200321193751.24985-2-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: x86: Allocate new rmap and large page tracking when moving memslotSean Christopherson
commit edd4fa37baa6ee8e44dc65523b27bd6fe44c94de upstream. Reallocate a rmap array and recalcuate large page compatibility when moving an existing memslot to correctly handle the alignment properties of the new memslot. The number of rmap entries required at each level is dependent on the alignment of the memslot's base gfn with respect to that level, e.g. moving a large-page aligned memslot so that it becomes unaligned will increase the number of rmap entries needed at the now unaligned level. Not updating the rmap array is the most obvious bug, as KVM accesses garbage data beyond the end of the rmap. KVM interprets the bad data as pointers, leading to non-canonical #GPs, unexpected #PFs, etc... general protection fault: 0000 [#1] SMP CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:rmap_get_first+0x37/0x50 [kvm] Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3 RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246 RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012 RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570 RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8 R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000 R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8 FS: 00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0 Call Trace: kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm] __kvm_set_memory_region.part.64+0x559/0x960 [kvm] kvm_set_memory_region+0x45/0x60 [kvm] kvm_vm_ioctl+0x30f/0x920 [kvm] do_vfs_ioctl+0xa1/0x620 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4c/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f7ed9911f47 Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47 RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004 RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700 R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000 R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000 Modules linked in: kvm_intel kvm irqbypass ---[ end trace 0c5f570b3358ca89 ]--- The disallow_lpage tracking is more subtle. Failure to update results in KVM creating large pages when it shouldn't, either due to stale data or again due to indexing beyond the end of the metadata arrays, which can lead to memory corruption and/or leaking data to guest/userspace. Note, the arrays for the old memslot are freed by the unconditional call to kvm_free_memslot() in __kvm_set_memory_region(). Fixes: 05da45583de9b ("KVM: MMU: large page support") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: nVMX: Properly handle userspace interrupt window requestSean Christopherson
commit a1c77abb8d93381e25a8d2df3a917388244ba776 upstream. Return true for vmx_interrupt_allowed() if the vCPU is in L2 and L1 has external interrupt exiting enabled. IRQs are never blocked in hardware if the CPU is in the guest (L2 from L1's perspective) when IRQs trigger VM-Exit. The new check percolates up to kvm_vcpu_ready_for_interrupt_injection() and thus vcpu_run(), and so KVM will exit to userspace if userspace has requested an interrupt window (to inject an IRQ into L1). Remove the @external_intr param from vmx_check_nested_events(), which is actually an indicator that userspace wants an interrupt window, e.g. it's named @req_int_win further up the stack. Injecting a VM-Exit into L1 to try and bounce out to L0 userspace is all kinds of broken and is no longer necessary. Remove the hack in nested_vmx_vmexit() that attempted to workaround the breakage in vmx_check_nested_events() by only filling interrupt info if there's an actual interrupt pending. The hack actually made things worse because it caused KVM to _never_ fill interrupt info when the LAPIC resides in userspace (kvm_cpu_has_interrupt() queries interrupt.injected, which is always cleared by prepare_vmcs12() before reaching the hack in nested_vmx_vmexit()). Fixes: 6550c4df7e50 ("KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"") Cc: stable@vger.kernel.org Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17x86/entry/32: Add missing ASM_CLAC to general_protection entryThomas Gleixner
commit 3d51507f29f2153a658df4a0674ec5b592b62085 upstream. All exception entry points must have ASM_CLAC right at the beginning. The general_protection entry is missing one. Fixes: e59d1b0a2419 ("x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200225220216.219537887@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17acpi/x86: ignore unspecified bit positions in the ACPI global lock fieldJan Engelhardt
commit ecb9c790999fd6c5af0f44783bd0217f0b89ec2b upstream. The value in "new" is constructed from "old" such that all bits defined as reserved by the ACPI spec[1] are left untouched. But if those bits do not happen to be all zero, "new < 3" will not evaluate to true. The firmware of the laptop(s) Medion MD63490 / Akoya P15648 comes with garbage inside the "FACS" ACPI table. The starting value is old=0x4944454d, therefore new=0x4944454e, which is >= 3. Mask off the reserved bits. [1] https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf Link: https://bugzilla.kernel.org/show_bug.cgi?id=206553 Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17x86/boot: Use unsigned comparison for addressesArvind Sankar
[ Upstream commit 81a34892c2c7c809f9c4e22c5ac936ae673fb9a2 ] The load address is compared with LOAD_PHYSICAL_ADDR using a signed comparison currently (using jge instruction). When loading a 64-bit kernel using the new efi32_pe_entry() point added by: 97aa276579b2 ("efi/x86: Add true mixed mode entry point into .compat section") using Qemu with -m 3072, the firmware actually loads us above 2Gb, resulting in a very early crash. Use the JAE instruction to perform a unsigned comparison instead, as physical addresses should be considered unsigned. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200301230436.2246909-6-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-14-ardb@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17x86: Don't let pgprot_modify() change the page encryption bitThomas Hellstrom
[ Upstream commit 6db73f17c5f155dbcfd5e48e621c706270b84df0 ] When SEV or SME is enabled and active, vm_get_page_prot() typically returns with the encryption bit set. This means that users of pgprot_modify(, vm_get_page_prot()) (mprotect_fixup(), do_mmap()) end up with a value of vma->vm_pg_prot that is not consistent with the intended protection of the PTEs. This is also important for fault handlers that rely on the VMA vm_page_prot to set the page protection. Fix this by not allowing pgprot_modify() to change the encryption bit, similar to how it's done for PAT bits. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20200304114527.3636-2-thomas_os@shipmail.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-02ftrace/x86: Anotate text_mutex split between ↵Jiri Kosina
ftrace_arch_code_modify_post_process() and ftrace_arch_code_modify_prepare() commit 074376ac0e1d1fcd4fafebca86ee6158e7c20680 upstream. ftrace_arch_code_modify_prepare() is acquiring text_mutex, while the corresponding release is happening in ftrace_arch_code_modify_post_process(). This has already been documented in the code, but let's also make the fact that this is intentional clear to the semantic analysis tools such as sparse. Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1906292321170.27227@cbobk.fhfr.pm Fixes: 39611265edc1a ("ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare()") Fixes: d5b844a2cf507 ("ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()") Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25x86/mm: split vmalloc_sync_all()Joerg Roedel
commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream. Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-20perf/amd/uncore: Replace manual sampling check with CAP_NO_INTERRUPT flagKim Phillips
[ Upstream commit f967140dfb7442e2db0868b03b961f9c59418a1b ] Enable the sampling check in kernel/events/core.c::perf_event_open(), which returns the more appropriate -EOPNOTSUPP. BEFORE: $ sudo perf record -a -e instructions,l3_request_g1.caching_l3_cache_accesses true Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (l3_request_g1.caching_l3_cache_accesses). /bin/dmesg | grep -i perf may provide additional information. With nothing relevant in dmesg. AFTER: $ sudo perf record -a -e instructions,l3_request_g1.caching_l3_cache_accesses true Error: l3_request_g1.caching_l3_cache_accesses: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat' Fixes: c43ca5091a37 ("perf/x86/amd: Add support for AMD NB and L2I "uncore" counters") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200311191323.13124-1-kim.phillips@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-18x86/mce: Fix logic and comments around MSR_PPIN_CTLTony Luck
commit 59b5809655bdafb0767d3fd00a3e41711aab07e6 upstream. There are two implemented bits in the PPIN_CTL MSR: Bit 0: LockOut (R/WO) Set 1 to prevent further writes to MSR_PPIN_CTL. Bit 1: Enable_PPIN (R/W) If 1, enables MSR_PPIN to be accessible using RDMSR. If 0, an attempt to read MSR_PPIN will cause #GP. So there are four defined values: 0: PPIN is disabled, PPIN_CTL may be updated 1: PPIN is disabled. PPIN_CTL is locked against updates 2: PPIN is enabled. PPIN_CTL may be updated 3: PPIN is enabled. PPIN_CTL is locked against updates Code would only enable the X86_FEATURE_INTEL_PPIN feature for case "2". When it should have done so for both case "2" and case "3". Fix the final test to just check for the enable bit. Also fix some of the other comments in this function. Fixes: 3f5a7896a509 ("x86/mce: Include the PPIN in MCE records when available") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200226011737.9958-1-tony.luck@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-18KVM: x86: clear stale x86_emulate_ctxt->intercept valueVitaly Kuznetsov
commit 342993f96ab24d5864ab1216f46c0b199c2baf8e upstream. After commit 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode") Hyper-V guests on KVM stopped booting with: kvm_nested_vmexit: rip fffff802987d6169 reason EPT_VIOLATION info1 181 info2 0 int_info 0 int_info_err 0 kvm_page_fault: address febd0000 error_code 181 kvm_emulate_insn: 0:fffff802987d6169: f3 a5 kvm_emulate_insn: 0:fffff802987d6169: f3 a5 FAIL kvm_inj_exception: #UD (0x0) "f3 a5" is a "rep movsw" instruction, which should not be intercepted at all. Commit c44b4c6ab80e ("KVM: emulate: clean up initializations in init_decode_cache") reduced the number of fields cleared by init_decode_cache() claiming that they are being cleared elsewhere, 'intercept', however, is left uncleared if the instruction does not have any of the "slow path" flags (NotImpl, Stack, Op3264, Sse, Mmx, CheckPerm, NearBranch, No16 and of course Intercept itself). Fixes: c44b4c6ab80e ("KVM: emulate: clean up initializations in init_decode_cache") Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode") Cc: stable@vger.kernel.org Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-16KVM: SVM: fix up incorrect backportGreg Kroah-Hartman
When I backported 52918ed5fcf0 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled") to 4.19 (which resulted in commit a4e761c9f63a ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")), I messed up the call to kvm_mmu_set_mmio_spte_mask() Fix that here now. Reported-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11efi/x86: Handle by-ref arguments covering multiple pages in mixed modeArd Biesheuvel
commit 8319e9d5ad98ffccd19f35664382c73cea216193 upstream. The mixed mode runtime wrappers are fragile when it comes to how the memory referred to by its pointer arguments are laid out in memory, due to the fact that it translates these addresses to physical addresses that the runtime services can dereference when running in 1:1 mode. Since vmalloc'ed pages (including the vmap'ed stack) are not contiguous in the physical address space, this scheme only works if the referenced memory objects do not cross page boundaries. Currently, the mixed mode runtime service wrappers require that all by-ref arguments that live in the vmalloc space have a size that is a power of 2, and are aligned to that same value. While this is a sensible way to construct an object that is guaranteed not to cross a page boundary, it is overly strict when it comes to checking whether a given object violates this requirement, as we can simply take the physical address of the first and the last byte, and verify that they point into the same physical page. When this check fails, we emit a WARN(), but then simply proceed with the call, which could cause data corruption if the next physical page belongs to a mapping that is entirely unrelated. Given that with vmap'ed stacks, this condition is much more likely to trigger, let's relax the condition a bit, but fail the runtime service call if it does trigger. Fixes: f6697df36bdf0bf7 ("x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-efi@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200221084849.26878-4-ardb@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11efi/x86: Align GUIDs to their size in the mixed mode runtime wrapperArd Biesheuvel
commit 63056e8b5ebf41d52170e9f5ba1fc83d1855278c upstream. Hans reports that his mixed mode systems running v5.6-rc1 kernels hit the WARN_ON() in virt_to_phys_or_null_size(), caused by the fact that efi_guid_t objects on the vmap'ed stack happen to be misaligned with respect to their sizes. As a quick (i.e., backportable) fix, copy GUID pointer arguments to the local stack into a buffer that is naturally aligned to its size, so that it is guaranteed to cover only one physical page. Note that on x86, we cannot rely on the stack pointer being aligned the way the compiler expects, so we need to allocate an 8-byte aligned buffer of sufficient size, and copy the GUID into that buffer at an offset that is aligned to 16 bytes. Fixes: f6697df36bdf0bf7 ("x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y") Reported-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Hans de Goede <hdegoede@redhat.com> Cc: linux-efi@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200221084849.26878-2-ardb@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11x86/pkeys: Manually set X86_FEATURE_OSPKE to preserve existing changesSean Christopherson
commit 735a6dd02222d8d070c7bb748f25895239ca8c92 upstream. Explicitly set X86_FEATURE_OSPKE via set_cpu_cap() instead of calling get_cpu_cap() to pull the feature bit from CPUID after enabling CR4.PKE. Invoking get_cpu_cap() effectively wipes out any {set,clear}_cpu_cap() changes that were made between this_cpu->c_init() and setup_pku(), as all non-synthetic feature words are reinitialized from the CPU's CPUID values. Blasting away capability updates manifests most visibility when running on a VMX capable CPU, but with VMX disabled by BIOS. To indicate that VMX is disabled, init_ia32_feat_ctl() clears X86_FEATURE_VMX, using clear_cpu_cap() instead of setup_clear_cpu_cap() so that KVM can report which CPU is misconfigured (KVM needs to probe every CPU anyways). Restoring X86_FEATURE_VMX from CPUID causes KVM to think VMX is enabled, ultimately leading to an unexpected #GP when KVM attempts to do VMXON. Arguably, init_ia32_feat_ctl() should use setup_clear_cpu_cap() and let KVM figure out a different way to report the misconfigured CPU, but VMX is not the only feature bit that is affected, i.e. there is precedent that tweaking feature bits via {set,clear}_cpu_cap() after ->c_init() is expected to work. Most notably, x86_init_rdrand()'s clearing of X86_FEATURE_RDRAND when RDRAND malfunctions is also overwritten. Fixes: 0697694564c8 ("x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU") Reported-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Jacob Keller <jacob.e.keller@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200226231615.13664-1-sean.j.christopherson@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11x86/xen: Distribute switch variables for initializationKees Cook
[ Upstream commit 9038ec99ceb94fb8d93ade5e236b2928f0792c7c ] Variables declared in a switch statement before any case statements cannot be automatically initialized with compiler instrumentation (as they are not part of any execution flow). With GCC's proposed automatic stack variable initialization feature, this triggers a warning (and they don't get initialized). Clang's automatic stack variable initialization (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't initialize such variables[1]. Note that these warnings (or silent skipping) happen before the dead-store elimination optimization phase, so even when the automatic initializations are later elided in favor of direct initializations, the warnings remain. To avoid these problems, move such variables into the "case" where they're used or lift them up into the main function body. arch/x86/xen/enlighten_pv.c: In function ‘xen_write_msr_safe’: arch/x86/xen/enlighten_pv.c:904:12: warning: statement will never be executed [-Wswitch-unreachable] 904 | unsigned which; | ^~~~~ [1] https://bugs.llvm.org/show_bug.cgi?id=44916 Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200220062318.69299-1-keescook@chromium.org Reviewed-by: Juergen Gross <jgross@suse.com> [boris: made @which an 'unsigned int'] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-11x86/boot/compressed: Don't declare __force_order in kaslr_64.cH.J. Lu
[ Upstream commit df6d4f9db79c1a5d6f48b59db35ccd1e9ff9adfc ] GCC 10 changed the default to -fno-common, which leads to LD arch/x86/boot/compressed/vmlinux ld: arch/x86/boot/compressed/pgtable_64.o:(.bss+0x0): multiple definition of `__force_order'; \ arch/x86/boot/compressed/kaslr_64.o:(.bss+0x0): first defined here make[2]: *** [arch/x86/boot/compressed/Makefile:119: arch/x86/boot/compressed/vmlinux] Error 1 Since __force_order is already provided in pgtable_64.c, there is no need to declare __force_order in kaslr_64.c. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200124181811.4780-1-hjl.tools@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-05KVM: x86: Remove spurious clearing of async #PF MSRSean Christopherson
commit 208050dac5ef4de5cb83ffcafa78499c94d0b5ad upstream. Remove a bogus clearing of apf.msr_val from kvm_arch_vcpu_destroy(). apf.msr_val is only set to a non-zero value by kvm_pv_enable_async_pf(), which is only reachable by kvm_set_msr_common(), i.e. by writing MSR_KVM_ASYNC_PF_EN. KVM does not autonomously write said MSR, i.e. can only be written via KVM_SET_MSRS or KVM_RUN. Since KVM_SET_MSRS and KVM_RUN are vcpu ioctls, they require a valid vcpu file descriptor. kvm_arch_vcpu_destroy() is only called if KVM_CREATE_VCPU fails, and KVM declares KVM_CREATE_VCPU successful once the vcpu fd is installed and thus visible to userspace. Ergo, apf.msr_val cannot be non-zero when kvm_arch_vcpu_destroy() is called. Fixes: 344d9588a9df0 ("KVM: Add PV MSR to enable asynchronous page faults delivery.") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05KVM: x86: Remove spurious kvm_mmu_unload() from vcpu destruction pathSean Christopherson
commit 9d979c7e6ff43ca3200ffcb74f57415fd633a2da upstream. x86 does not load its MMU until KVM_RUN, which cannot be invoked until after vCPU creation succeeds. Given that kvm_arch_vcpu_destroy() is called if and only if vCPU creation fails, it is impossible for the MMU to be loaded. Note, the bogus kvm_mmu_unload() call was added during an unrelated refactoring of vCPU allocation, i.e. was presumably added as an opportunstic "fix" for a perceived leak. Fixes: fb3f0f51d92d1 ("KVM: Dynamically allocate vcpus") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05KVM: SVM: Override default MMIO mask if memory encryption is enabledTom Lendacky
commit 52918ed5fcf05d97d257f4131e19479da18f5d16 upstream. The KVM MMIO support uses bit 51 as the reserved bit to cause nested page faults when a guest performs MMIO. The AMD memory encryption support uses a CPUID function to define the encryption bit position. Given this, it is possible that these bits can conflict. Use svm_hardware_setup() to override the MMIO mask if memory encryption support is enabled. Various checks are performed to ensure that the mask is properly defined and rsvd_bits() is used to generate the new mask (as was done prior to the change that necessitated this patch). Fixes: 28a1f3ac1d0c ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs") Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05KVM: VMX: check descriptor table exits on instruction emulationOliver Upton
commit 86f7e90ce840aa1db407d3ea6e9b3a52b2ce923c upstream. KVM emulates UMIP on hardware that doesn't support it by setting the 'descriptor table exiting' VM-execution control and performing instruction emulation. When running nested, this emulation is broken as KVM refuses to emulate L2 instructions by default. Correct this regression by allowing the emulation of descriptor table instructions if L1 hasn't requested 'descriptor table exiting'. Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode") Reported-by: Jan Kiszka <jan.kiszka@web.de> Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: apic: avoid calculating pending eoi from an uninitialized valMiaohe Lin
commit 23520b2def95205f132e167cf5b25c609975e959 upstream. When pv_eoi_get_user() fails, 'val' may remain uninitialized and the return value of pv_eoi_get_pending() becomes random. Fix the issue by initializing the variable. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: handle nested posted interrupts when apicv is disabled for L1Vitaly Kuznetsov
commit 91a5f413af596ad01097e59bf487eb07cb3f1331 upstream. Even when APICv is disabled for L1 it can (and, actually, is) still available for L2, this means we need to always call vmx_deliver_nested_posted_interrupt() when attempting an interrupt delivery. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: Check IO instruction VM-exit conditionsOliver Upton
commit 35a571346a94fb93b5b3b6a599675ef3384bc75c upstream. Consult the 'unconditional IO exiting' and 'use IO bitmaps' VM-execution controls when checking instruction interception. If the 'use IO bitmaps' VM-execution control is 1, check the instruction access against the IO bitmaps to determine if the instruction causes a VM-exit. Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: Refactor IO bitmap checks into helper functionOliver Upton
commit e71237d3ff1abf9f3388337cfebf53b96df2020d upstream. Checks against the IO bitmap are useful for both instruction emulation and VM-exit reflection. Refactor the IO bitmap checks into a helper function. Signed-off-by: Oliver Upton <oupton@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: x86: don't notify userspace IOAPIC on edge-triggered interrupt EOIMiaohe Lin
commit 7455a8327674e1a7c9a1f5dd1b0743ab6713f6d1 upstream. Commit 13db77347db1 ("KVM: x86: don't notify userspace IOAPIC on edge EOI") said, edge-triggered interrupts don't set a bit in TMR, which means that IOAPIC isn't notified on EOI. And var level indicates level-triggered interrupt. But commit 3159d36ad799 ("KVM: x86: use generic function for MSI parsing") replace var level with irq.level by mistake. Fix it by changing irq.level to irq.trig_mode. Cc: stable@vger.kernel.org Fixes: 3159d36ad799 ("KVM: x86: use generic function for MSI parsing") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: Don't emulate instructions in guest modePaolo Bonzini
commit 07721feee46b4b248402133228235318199b05ec upstream. vmx_check_intercept is not yet fully implemented. To avoid emulating instructions disallowed by the L1 hypervisor, refuse to emulate instructions by default. Cc: stable@vger.kernel.org [Made commit, added commit msg - Oliver] Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERFKim Phillips
commit 21b5ee59ef18e27d85810584caf1f7ddc705ea83 upstream. Commit aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter") added support for access to the free-running counter via 'perf -e msr/irperf/', but when exercised, it always returns a 0 count: BEFORE: $ perf stat -e instructions,msr/irperf/ true Performance counter stats for 'true': 624,833 instructions 0 msr/irperf/ Simply set its enable bit - HWCR bit 30 - to make it start counting. Enablement is restricted to all machines advertising IRPERF capability, except those susceptible to an erratum that makes the IRPERF return bad values. That erratum occurs in Family 17h models 00-1fh [1], but not in F17h models 20h and above [2]. AFTER (on a family 17h model 31h machine): $ perf stat -e instructions,msr/irperf/ true Performance counter stats for 'true': 621,690 instructions 622,490 msr/irperf/ [1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors [2] Revision Guide for AMD Family 17h Models 30h-3Fh Processors The revision guides are available from the bugzilla Link below. [ bp: Massage commit message. ] Fixes: aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28x86/mce/amd: Fix kobject lifetimeThomas Gleixner
commit 51dede9c05df2b78acd6dcf6a17d21f0877d2d7b upstream. Accessing the MCA thresholding controls in sysfs concurrently with CPU hotplug can lead to a couple of KASAN-reported issues: BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180 Read of size 8 at addr ffff888367578940 by task grep/4019 and BUG: KASAN: use-after-free in show_error_count+0x15c/0x180 Read of size 2 at addr ffff888368a05514 by task grep/4454 for example. Both result from the fact that the threshold block creation/teardown code frees the descriptor memory itself instead of defining proper ->release function and leaving it to the driver core to take care of that, after all sysfs accesses have completed. Do that and get rid of the custom freeing code, fixing the above UAFs in the process. [ bp: write commit message. ] Fixes: 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28x86/mce/amd: Publish the bank pointer only after setup has succeededBorislav Petkov
commit 6e5cf31fbe651bed7ba1df768f2e123531132417 upstream. threshold_create_bank() creates a bank descriptor per MCA error thresholding counter which can be controlled over sysfs. It publishes the pointer to that bank in a per-CPU variable and then goes on to create additional thresholding blocks if the bank has such. However, that creation of additional blocks in allocate_threshold_blocks() can fail, leading to a use-after-free through the per-CPU pointer. Therefore, publish that pointer only after all blocks have been setup successfully. Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor") Reported-by: Saar Amar <Saar.Amar@microsoft.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-24x86/decoder: Add TEST opcode to Group3-2Masami Hiramatsu
[ Upstream commit 8b7e20a7ba54836076ff35a28349dabea4cec48f ] Add TEST opcode to Group3-2 reg=001b as same as Group3-1 does. Commit 12a78d43de76 ("x86/decoder: Add new TEST instruction pattern") added a TEST opcode assignment to f6 XX/001/XXX (Group 3-1), but did not add f7 XX/001/XXX (Group 3-2). Actually, this TEST opcode variant (ModRM.reg /1) is not described in the Intel SDM Vol2 but in AMD64 Architecture Programmer's Manual Vol.3, Appendix A.2 Table A-6. ModRM.reg Extensions for the Primary Opcode Map. Without this fix, Randy found a warning by insn_decoder_test related to this issue as below. HOSTCC arch/x86/tools/insn_decoder_test HOSTCC arch/x86/tools/insn_sanity TEST posttest arch/x86/tools/insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this. arch/x86/tools/insn_decoder_test: warning: ffffffff81000bf1: f7 0b 00 01 08 00 testl $0x80100,(%rbx) arch/x86/tools/insn_decoder_test: warning: objdump says 6 bytes, but insn_get_length() says 2 arch/x86/tools/insn_decoder_test: warning: Decoded and checked 11913894 instructions with 1 failures TEST posttest arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0x871ce29c) To fix this error, add the TEST opcode according to AMD64 APM Vol.3. [ bp: Massage commit message. ] Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lkml.kernel.org/r/157966631413.9580.10311036595431878351.stgit@devnote2 Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgdArd Biesheuvel
[ Upstream commit 75fbef0a8b6b4bb19b9a91b5214f846c2dc5139e ] The following commit: 15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()") modified kernel_map_pages_in_pgd() to manage writable permissions of memory mappings in the EFI page table in a different way, but in the process, it removed the ability to clear NX attributes from read-only mappings, by clobbering the clear mask if _PAGE_RW is not being requested. Failure to remove the NX attribute from read-only mappings is unlikely to be a security issue, but it does prevent us from tightening the permissions in the EFI page tables going forward, so let's fix it now. Fixes: 15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd() Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200113172245.27925-5-ardb@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24x86/nmi: Remove irq_work from the long duration NMI handlerChangbin Du
[ Upstream commit 248ed51048c40d36728e70914e38bffd7821da57 ] First, printk() is NMI-context safe now since the safe printk() has been implemented and it already has an irq_work to make NMI-context safe. Second, this NMI irq_work actually does not work if a NMI handler causes panic by watchdog timeout. It has no chance to run in such case, while the safe printk() will flush its per-cpu buffers before panicking. While at it, repurpose the irq_work callback into a function which concentrates the NMI duration checking and makes the code easier to follow. [ bp: Massage. ] Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200111125427.15662-1-changbin.du@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>