aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/include/asm
AgeCommit message (Collapse)Author
2022-05-09x86/cpu: Load microcode during restore_processor_state()Borislav Petkov
commit f9e14dbbd454581061c736bf70bf5cbb15ac927c upstream. When resuming from system sleep state, restore_processor_state() restores the boot CPU MSRs. These MSRs could be emulated by microcode. If microcode is not loaded yet, writing to emulated MSRs leads to unchecked MSR access error: ... PM: Calling lapic_suspend+0x0/0x210 unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0...0) at rIP: ... (native_write_msr) Call Trace: <TASK> ? restore_processor_state x86_acpi_suspend_lowlevel acpi_suspend_enter suspend_devices_and_enter pm_suspend.cold state_store kobj_attr_store sysfs_kf_write kernfs_fop_write_iter new_sync_write vfs_write ksys_write __x64_sys_write do_syscall_64 entry_SYSCALL_64_after_hwframe RIP: 0033:0x7fda13c260a7 To ensure microcode emulated MSRs are available for restoration, load the microcode on the boot CPU before restoring these MSRs. [ Pawan: write commit message and productize it. ] Fixes: e2a1256b17b1 ("x86/speculation: Restore speculation related MSRs during S3 resume") Reported-by: Kyle D. Pelton <kyle.d.pelton@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Tested-by: Kyle D. Pelton <kyle.d.pelton@intel.com> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=215841 Link: https://lore.kernel.org/r/4350dfbf785cd482d3fafa72b2b49c83102df3ce.1650386317.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-27stat: fix inconsistency between struct stat and struct compat_statMikulas Patocka
[ Upstream commit 932aba1e169090357a77af18850a10c256b50819 ] struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit st_dev and st_rdev; struct compat_stat (defined in arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by a 16-bit padding. This patch fixes struct compat_stat to match struct stat. [ Historical note: the old x86 'struct stat' did have that 16-bit field that the compat layer had kept around, but it was changes back in 2003 by "struct stat - support larger dev_t": https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d and back in those days, the x86_64 port was still new, and separate from the i386 code, and had already picked up the old version with a 16-bit st_dev field ] Note that we can't change compat_dev_t because it is used by compat_loop_info. Also, if the st_dev and st_rdev values are 32-bit, we don't have to use old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW on filesystems that are on NVMe because NVMe uses the major number 259. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-20KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loadedSean Christopherson
commit 1d0e84806047f38027d7572adb4702ef7c09b317 upstream. Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as -1 is technically undefined behavior when its value is read out by param_get_bool(), as boolean values are supposed to be '0' or '1'. Alternatively, KVM could define a custom getter for the param, but the auto value doesn't depend on the vendor module in any way, and printing "auto" would be unnecessarily unfriendly to the user. In addition to fixing the undefined behavior, resolving the auto value also fixes the scenario where the auto value resolves to N and no vendor module is loaded. Previously, -1 would result in Y being printed even though KVM would ultimately disable the mitigation. Rename the existing MMU module init/exit helpers to clarify that they're invoked with respect to the vendor module, and add comments to document why KVM has two separate "module init" flows. ========================================================================= UBSAN: invalid-load in kernel/params.c:320:33 load of value 255 is not a valid value for type '_Bool' CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 ubsan_epilogue+0x5/0x40 __ubsan_handle_load_invalid_value.cold+0x43/0x48 param_get_bool.cold+0xf/0x14 param_attr_show+0x55/0x80 module_attr_show+0x1c/0x30 sysfs_kf_seq_show+0x93/0xc0 seq_read_iter+0x11c/0x450 new_sync_read+0x11b/0x1a0 vfs_read+0xf0/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ========================================================================= Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220331221359.3912754-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Add eIBRS + Retpoline optionsPeter Zijlstra
commit 1e19da8522c81bf46b335f84137165741e0d82b7 upstream. Thanks to the chaps at VUsec it is now clear that eIBRS is not sufficient, therefore allow enabling of retpolines along with eIBRS. Add spectre_v2=eibrs, spectre_v2=eibrs,lfence and spectre_v2=eibrs,retpoline options to explicitly pick your preferred means of mitigation. Since there's new mitigations there's also user visible changes in /sys/devices/system/cpu/vulnerabilities/spectre_v2 to reflect these new mitigations. [ bp: Massage commit message, trim error messages, do more precise eIBRS mode checking. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCEPeter Zijlstra (Intel)
commit d45476d9832409371537013ebdd8dc1a7781f97a upstream. The RETPOLINE_AMD name is unfortunate since it isn't necessarily AMD only, in fact Hygon also uses it. Furthermore it will likely be sufficient for some Intel processors. Therefore rename the thing to RETPOLINE_LFENCE to better describe what it is. Add the spectre_v2=retpoline,lfence option as an alias to spectre_v2=retpoline,amd to preserve existing setups. However, the output of /sys/devices/system/cpu/vulnerabilities/spectre_v2 will be changed. [ bp: Fix typos, massage. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> [fllinden@amazon.com: backported to 5.10] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-02x86/fpu: Correct pkru/xstate inconsistencyBrian Geffon
When eagerly switching PKRU in switch_fpu_finish() it checks that current is not a kernel thread as kernel threads will never use PKRU. It's possible that this_cpu_read_stable() on current_task (ie. get_current()) is returning an old cached value. To resolve this reference next_p directly rather than relying on current. As written it's possible when switching from a kernel thread to a userspace thread to observe a cached PF_KTHREAD flag and never restore the PKRU. And as a result this issue only occurs when switching from a kernel thread to a userspace thread, switching from a non kernel thread works perfectly fine because all that is considered in that situation are the flags from some other non kernel task and the next fpu is passed in to switch_fpu_finish(). This behavior only exists between 5.2 and 5.13 when it was fixed by a rewrite decoupling PKRU from xstate, in: commit 954436989cc5 ("x86/fpu: Remove PKRU handling from switch_fpu_finish()") Unfortunately backporting the fix from 5.13 is probably not realistic as it's part of a 60+ patch series which rewrites most of the PKRU handling. Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state") Signed-off-by: Brian Geffon <bgeffon@google.com> Signed-off-by: Willis Kung <williskung@google.com> Tested-by: Willis Kung <williskung@google.com> Cc: <stable@vger.kernel.org> # v5.4.x Cc: <stable@vger.kernel.org> # v5.10.x Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-05KVM: x86: Forcibly leave nested virt when SMM state is toggledSean Christopherson
commit f7e570780efc5cec9b2ed1e0472a7da14e864fdb upstream. Forcibly leave nested virtualization operation if userspace toggles SMM state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace forces the vCPU out of SMM while it's post-VMXON and then injects an SMI, vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both vmxon=false and smm.vmxon=false, but all other nVMX state allocated. Don't attempt to gracefully handle the transition as (a) most transitions are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't sufficient information to handle all transitions, e.g. SVM wants access to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede KVM_SET_NESTED_STATE during state restore as the latter disallows putting the vCPU into L2 if SMM is active, and disallows tagging the vCPU as being post-VMXON in SMM if SMM is not active. Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU in an architecturally impossible state. WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Modules linked in: CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00 Call Trace: <TASK> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123 kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline] kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460 kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline] kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline] kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250 kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273 __fput+0x286/0x9f0 fs/file_table.c:311 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xb29/0x2a30 kernel/exit.c:806 do_group_exit+0xd2/0x2f0 kernel/exit.c:935 get_signal+0x4b0/0x28c0 kernel/signal.c:2862 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Cc: stable@vger.kernel.org Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220125220358.2091737-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Backported-by: Tadeusz Struk <tadeusz.struk@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27x86/mm: Flush global TLB when switching to trampoline page-tableJoerg Roedel
[ Upstream commit 71d5049b053876afbde6c3273250b76935494ab2 ] Move the switching code into a function so that it can be re-used and add a global TLB flush. This makes sure that usage of memory which is not mapped in the trampoline page-table is reliably caught. Also move the clearing of CR4.PCIDE before the CR3 switch because the cr4_clear_bits() function will access data not mapped into the trampoline page-table. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211202153226.22946-4-joro@8bytes.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/uaccess: Move variable into switch case statementKees Cook
[ Upstream commit 61646ca83d3889696f2772edaff122dd96a2935e ] When building with automatic stack variable initialization, GCC 12 complains about variables defined outside of switch case statements. Move the variable into the case that uses it, which silences the warning: ./arch/x86/include/asm/uaccess.h:317:23: warning: statement will never be executed [-Wswitch-unreachable] 317 | unsigned char x_u8__; \ | ^~~~~~ Fixes: 865c50e1d279 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT") Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211209043456.1377875-1-keescook@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-20KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guestSean Christopherson
commit f4b027c5c8199abd4fb6f00d67d380548dbfdfa8 upstream. Override the Processor Trace (PT) interrupt handler for guest mode if and only if PT is configured for host+guest mode, i.e. is being used independently by both host and guest. If PT is configured for system mode, the host fully controls PT and must handle all events. Fixes: 8479e04e7d6b ("KVM: x86: Inject PMI for KVM guest") Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Reported-by: Artem Kashkanov <artem.kashkanov@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211111020738.2512932-4-seanjc@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-29x86/pkey: Fix undefined behaviour with PKRU_WD_BITAndrew Cooper
commit 57690554abe135fee81d6ac33cc94d75a7e224bb upstream. Both __pkru_allows_write() and arch_set_user_pkey_access() shift PKRU_WD_BIT (a signed constant) by up to 30 bits, hitting the sign bit. Use unsigned constants instead. Clearly pkey 15 has not been used in combination with UBSAN yet. Noticed by code inspection only. I can't actually provoke the compiler into generating incorrect logic as far as this shift is concerned. [ dhansen: add stable@ tag, plus minor changelog massaging, For anyone doing backports, these #defines were in arch/x86/include/asm/pgtable.h before 784a46618f6. ] Fixes: 33a709b25a76 ("mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20211216000856.4480-1-andrew.cooper3@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-14KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush ↵Vitaly Kuznetsov
hypercall commit 1ebfaa11ebb5b603a3c3f54b2e84fcf1030f5a14 upstream. Prior to commit 0baedd792713 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()"), kvm_hv_flush_tlb() was using 'KVM_REQ_TLB_FLUSH | KVM_REQUEST_NO_WAKEUP' when making a request to flush TLBs on other vCPUs and KVM_REQ_TLB_FLUSH is/was defined as: (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) so KVM_REQUEST_WAIT was lost. Hyper-V TLFS, however, requires that "This call guarantees that by the time control returns back to the caller, the observable effects of all flushes on the specified virtual processors have occurred." and without KVM_REQUEST_WAIT there's a small chance that the vCPU making the TLB flush will resume running before all IPIs get delivered to other vCPUs and a stale mapping can get read there. Fix the issue by adding KVM_REQUEST_WAIT flag to KVM_REQ_TLB_FLUSH_GUEST: kvm_hv_flush_tlb() is the sole caller which uses it for kvm_make_all_cpus_request()/kvm_make_vcpus_request_mask() where KVM_REQUEST_WAIT makes a difference. Cc: stable@kernel.org Fixes: 0baedd792713 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211209102937.584397-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-08x86/pv: Switch SWAPGS to ALTERNATIVEJuergen Gross
[ Upstream commit 53c9d9240944088274aadbbbafc6138ca462db4f ] SWAPGS is used only for interrupts coming from user mode or for returning to user mode. So there is no reason to use the PARAVIRT framework, as it can easily be replaced by an ALTERNATIVE depending on X86_FEATURE_XENPV. There are several instances using the PV-aware SWAPGS macro in paths which are never executed in a Xen PV guest. Replace those with the plain swapgs instruction. For SWAPGS_UNSAFE_STACK the same applies. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210120135555.32594-5-jgross@suse.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-21x86/iopl: Fake iopl(3) CLI/STI usagePeter Zijlstra
commit b968e84b509da593c50dc3db679e1d33de701f78 upstream. Since commit c8137ace5638 ("x86/iopl: Restrict iopl() permission scope") it's possible to emulate iopl(3) using ioperm(), except for the CLI/STI usage. Userspace CLI/STI usage is very dubious (read broken), since any exception taken during that window can lead to rescheduling anyway (or worse). The IOPL(2) manpage even states that usage of CLI/STI is highly discouraged and might even crash the system. Of course, that won't stop people and HP has the dubious honour of being the first vendor to be found using this in their hp-health package. In order to enable this 'software' to still 'work', have the #GP treat the CLI/STI instructions as NOPs when iopl(3). Warn the user that their program is doing dubious things. Fixes: a24ca9976843 ("x86/iopl: Remove legacy IOPL option") Reported-by: Ondrej Zary <linux@zary.sk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org # v5.5+ Link: https://lkml.kernel.org/r/20210918090641.GD5106@worktop.programming.kicks-ass.net Signed-off-by: Ondrej Zary <linux@zary.sk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/sev: Make the #VC exception stacks part of the default stacks storageBorislav Petkov
commit 541ac97186d9ea88491961a46284de3603c914fd upstream. The size of the exception stacks was increased by the commit in Fixes, resulting in stack sizes greater than a page in size. The #VC exception handling was only mapping the first (bottom) page, resulting in an SEV-ES guest failing to boot. Make the #VC exception stacks part of the default exception stacks storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map them only when a SEV-ES guest has been detected. Rip out the custom VC stacks mapping and storage code. [ bp: Steal and adapt Tom's commit message. ] Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes") Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Brijesh Singh <brijesh.singh@amd.com> Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/sev: Add an x86 version of cc_platform_has()Tom Lendacky
commit aa5a461171f98fde0df78c4f6b5018a1e967cf81 upstream. Introduce an x86 version of the cc_platform_has() function. This will be used to replace vendor specific calls like sme_active(), sev_active(), etc. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210928191009.32551-4-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86: Increase exception stack sizesPeter Zijlstra
[ Upstream commit 7fae4c24a2b84a66c7be399727aca11e7a888462 ] It turns out that a single page of stack is trivial to overflow with all the tracing gunk enabled. Raise the exception stacks to 2 pages, which is still half the interrupt stacks, which are at 4 pages. Reported-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YUIO9Ye98S5Eb68w@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-13x86/entry: Correct reference to intended CONFIG_64_BITLukas Bulwahn
commit 2c861f2b859385e9eaa6e464a8a7435b5a6bf564 upstream. Commit in Fixes adds a condition with IS_ENABLED(CONFIG_64_BIT), but the intended config item is called CONFIG_64BIT, as defined in arch/x86/Kconfig. Fortunately, scripts/checkkconfigsymbols.py warns: 64_BIT Referencing files: arch/x86/include/asm/entry-common.h Correct the reference to the intended config symbol. Fixes: 662a0221893a ("x86/entry: Fix AC assertion") Suggested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20210803113531.30720-2-lukas.bulwahn@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-06KVM: x86: Handle SRCU initialization failure during page track initHaimin Zhang
commit eb7511bf9182292ef1df1082d23039e856d1ddfb upstream. Check the return of init_srcu_struct(), which can fail due to OOM, when initializing the page track mechanism. Lack of checking leads to a NULL pointer deref found by a modified syzkaller. Reported-by: TCS Robot <tcs_robot@tencent.com> Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com> Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com> [Move the call towards the beginning of kvm_arch_init_vm. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-06x86/kvmclock: Move this_cpu_pvti into kvmclock.hZelin Deng
commit ad9af930680bb396c87582edc172b3a7cf2a3fbf upstream. There're other modules might use hv_clock_per_cpu variable like ptp_kvm, so move it into kvmclock.h and export the symbol to make it visiable to other modules. Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Cc: <stable@vger.kernel.org> Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-30x86/asm: Fix SETZ size enqcmds() build failureKees Cook
[ Upstream commit d81ff5fe14a950f53e2833cfa196e7bb3fd5d4e3 ] When building under GCC 4.9 and 5.5: arch/x86/include/asm/special_insns.h: Assembler messages: arch/x86/include/asm/special_insns.h:286: Error: operand size mismatch for `setz' Change the type to "bool" for condition code arguments, as documented. Fixes: 7f5933f81bd8 ("x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction") Co-developed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210910223332.3224851-1-keescook@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-30x86/asm: Add a missing __iomem annotation in enqcmds()Dave Jiang
[ Upstream commit 5c99720b28381bb400d4f546734c34ddaf608761 ] Add a missing __iomem annotation to address a sparse warning. The caller is expected to pass an __iomem annotated pointer to this function. The current usages send a 64-bytes command descriptor to an MMIO location (portal) on a device for consumption. Also, from the comment in movdir64b(), which also applies to enqcmds(), @__dst must be supplied as an lvalue because this tells the compiler what the object is (its size) the instruction accesses. I.e., not the pointers but what they point to, thus the deref'ing '*'." The actual sparse warning is: drivers/dma/idxd/submit.c: note: in included file (through arch/x86/include/asm/processor.h, \ arch/x86/include/asm/timex.h, include/linux/timex.h, include/linux/time32.h, \ include/linux/time.h, include/linux/stat.h, ...): ./arch/x86/include/asm/special_insns.h:289:41: warning: incorrect type in initializer (different address spaces) ./arch/x86/include/asm/special_insns.h:289:41: expected struct <noident> *__dst ./arch/x86/include/asm/special_insns.h:289:41: got void [noderef] __iomem *dst [ bp: Massage commit message. ] Fixes: 7f5933f81bd8 ("x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ben Widawsky <ben.widawsky@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lkml.kernel.org/r/161003789741.4062451.14362269365703761223.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-22x86/uaccess: Fix 32-bit __get_user_asm_u64() when CC_HAS_ASM_GOTO_OUTPUT=yWill Deacon
commit a69ae291e1cc2d08ae77c2029579c59c9bde5061 upstream. Commit 865c50e1d279 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT") added an optimised version of __get_user_asm() for x86 using 'asm goto'. Like the non-optimised code, the 32-bit implementation of 64-bit get_user() expands to a pair of 32-bit accesses. Unlike the non-optimised code, the _original_ pointer is incremented to copy the high word instead of loading through a new pointer explicitly constructed to point at a 32-bit type. Consequently, if the pointer points at a 64-bit type then we end up loading the wrong data for the upper 32-bits. This was observed as a mount() failure in Android targeting i686 after b0cfcdd9b967 ("d_path: make 'prepend()' fill up the buffer exactly on overflow") because the call to copy_from_kernel_nofault() from prepend_copy() ends up in __get_kernel_nofault() and casts the source pointer to a 'u64 __user *'. An attempt to mount at "/debug_ramdisk" therefore ends up failing trying to mount "/debumdismdisk". Use the existing '__gu_ptr' source pointer to unsigned int for 32-bit __get_user_asm_u64() instead of the original pointer. Cc: Bill Wendling <morbo@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: 865c50e1d279 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT") Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15x86/mce: Defer processing of early errorsBorislav Petkov
[ Upstream commit 3bff147b187d5dfccfca1ee231b0761a89f1eff5 ] When a fatal machine check results in a system reset, Linux does not clear the error(s) from machine check bank(s) - hardware preserves the machine check banks across a warm reset. During initialization of the kernel after the reboot, Linux reads, logs, and clears all machine check banks. But there is a problem. In: 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver") the call to mce_register_decode_chain() moved later in the boot sequence. This means that /dev/mcelog doesn't see those early error logs. This was partially fixed by: cd9c57cad3fe ("x86/MCE: Dump MCE to dmesg if no consumers") which made sure that the logs were not lost completely by printing to the console. But parsing console logs is error prone. Users of /dev/mcelog should expect to find any early errors logged to standard places. Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early machine check initialization to indicate that any errors found should just be queued to genpool. When mcheck_late_init() is called it will call mce_schedule_work() to actually log and flush any errors queued in the genpool. [ Based on an original patch, commit message by and completely productized by Tony Luck. ] Fixes: 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver") Reported-by: Sumanth Kamatala <skamatala@juniper.net> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-18KVM: nSVM: avoid picking up unsupported bits from L2 in int_ctl (CVE-2021-3653)Maxim Levitsky
commit 0f923e07124df069ba68d8bb12324398f4b6b709 upstream. * Invert the mask of bits that we pick from L2 in nested_vmcb02_prepare_control * Invert and explicitly use VIRQ related bits bitmask in svm_clear_vintr This fixes a security issue that allowed a malicious L1 to run L2 with AVIC enabled, which allowed the L2 to exploit the uninitialized and enabled AVIC to read/write the host physical memory at some offsets. Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-04x86/asm: Ensure asm/proto.h can be included stand-aloneJan Kiszka
[ Upstream commit f7b21a0e41171d22296b897dac6e4c41d2a3643c ] Fix: ../arch/x86/include/asm/proto.h:14:30: warning: ‘struct task_struct’ declared \ inside parameter list will not be visible outside of this definition or declaration long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2); ^~~~~~~~~~~ .../arch/x86/include/asm/proto.h:40:34: warning: ‘struct task_struct’ declared \ inside parameter list will not be visible outside of this definition or declaration long do_arch_prctl_common(struct task_struct *task, int option, ^~~~~~~~~~~ if linux/sched.h hasn't be included previously. This fixes a build error when this header is used outside of the kernel tree. [ bp: Massage commit message. ] Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/b76b4be3-cf66-f6b2-9a6c-3e7ef54f9845@web.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20x86/fpu: Return proper error codes from user access functionsThomas Gleixner
[ Upstream commit aee8c67a4faa40a8df4e79316dbfc92d123989c1 ] When *RSTOR from user memory raises an exception, there is no way to differentiate them. That's bad because it forces the slow path even when the failure was not a fault. If the operation raised eg. #GP then going through the slow path is pointless. Use _ASM_EXTABLE_FAULT() which stores the trap number and let the exception fixup return the negated trap number as error. This allows to separate the fast path and let it handle faults directly and avoid the slow path for all other exceptions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121457.601480369@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14x86/sev: Split up runtime #VC handler for correct state trackingJoerg Roedel
[ Upstream commit be1a5408868af341f61f93c191b5e346ee88c82a ] Split up the #VC handler code into a from-user and a from-kernel part. This allows clean and correct state tracking, as the #VC handler needs to enter NMI-state when raised from kernel mode and plain IRQ state when raised from user-mode. Fixes: 62441a1fb532 ("x86/sev-es: Correctly track IRQ states in runtime #VC handler") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210618115409.22735-3-joro@8bytes.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14KVM: nVMX: Sync all PGDs on nested transition with shadow pagingSean Christopherson
[ Upstream commit 07ffaf343e34b555c9e7ea39a9c81c439a706f13 ] Trigger a full TLB flush on behalf of the guest on nested VM-Enter and VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the current PGD, which can theoretically leave stale, unsync'd entries in a previous guest PGD, which could be consumed if L2 is allowed to load CR3 with PCID_NOFLUSH=1. Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can be utilized for its obvious purpose of emulating a guest TLB flush. Note, there is no change the actual TLB flush executed by KVM, even though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is disabled for L2, vpid02 is guaranteed to be '0', and thus nested_get_vpid02() will return the VPID that is shared by L1 and L2. Generate the request outside of kvm_mmu_new_pgd(), as getting the common helper to correctly identify which requested is needed is quite painful. E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as a TLB flush from the L1 kernel's perspective does not invalidate EPT mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future simplification by moving the logic into nested_vmx_transition_tlb_flush(). Fixes: 41fab65e7c44 ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14sched/core: Initialize the idle task with preemption disabledValentin Schneider
[ Upstream commit f1a0a376ca0c4ef1fc3d24e3e502acbb5b795674 ] As pointed out by commit de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread") init_idle() can and will be invoked more than once on the same idle task. At boot time, it is invoked for the boot CPU thread by sched_init(). Then smp_init() creates the threads for all the secondary CPUs and invokes init_idle() on them. As the hotplug machinery brings the secondaries to life, it will issue calls to idle_thread_get(), which itself invokes init_idle() yet again. In this case it's invoked twice more per secondary: at _cpu_up(), and at bringup_cpu(). Given smp_init() already initializes the idle tasks for all *possible* CPUs, no further initialization should be required. Now, removing init_idle() from idle_thread_get() exposes some interesting expectations with regards to the idle task's preempt_count: the secondary startup always issues a preempt_disable(), requiring some reset of the preempt count to 0 between hot-unplug and hotplug, which is currently served by idle_thread_get() -> idle_init(). Given the idle task is supposed to have preemption disabled once and never see it re-enabled, it seems that what we actually want is to initialize its preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove init_idle() from idle_thread_get(). Secondary startups were patched via coccinelle: @begone@ @@ -preempt_disable(); ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-07Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack"Sean Christopherson
commit f71a53d1180d5ecc346f0c6a23191d837fe2871b upstream. Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested virtualization. When KVM (L0) is using TDP, CR4.LA57 is not reflected in mmu_role.base.level because that tracks the shadow root level, i.e. TDP level. Normally, this is not an issue because LA57 can't be toggled while long mode is active, i.e. the guest has to first disable paging, then toggle LA57, then re-enable paging, thus ensuring an MMU reinitialization. But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57 without having to bounce through an unpaged section. L1 can also load a new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a single entry PML5 is sufficient. Such shenanigans are only problematic if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode. Note, in the L2 case with nested TDP, even though L1 can switch between L2s with different LA57 settings, thus bypassing the paging requirement, in that case KVM's nested_mmu will track LA57 in base.level. This reverts commit 8053f924cad30bf9f9a24e02b6c8ddfabf5202ea. Fixes: 8053f924cad3 ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30x86/fpu: Make init_fpstate correct with optimized XSAVEThomas Gleixner
commit f9dfb5e390fab2df9f7944bb91e7705aba14cd26 upstream. The XSAVE init code initializes all enabled and supported components with XRSTOR(S) to init state. Then it XSAVEs the state of the components back into init_fpstate which is used in several places to fill in the init state of components. This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because those use the init optimization and skip writing state of components which are in init state. So init_fpstate.xsave still contains all zeroes after this operation. There are two ways to solve that: 1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when XSAVES is enabled because XSAVES uses compacted format. 2) Save the components which are known to have a non-zero init state by other means. Looking deeper, #2 is the right thing to do because all components the kernel supports have all-zeroes init state except the legacy features (FP, SSE). Those cannot be hard coded because the states are not identical on all CPUs, but they can be saved with FXSAVE which avoids all conditionals. Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with a BUILD_BUG_ON() which reminds developers to validate that a newly added component has all zeroes init state. As a bonus remove the now unused copy_xregs_to_kernel_booting() crutch. The XSAVE and reshuffle method can still be implemented in the unlikely case that components are added which have a non-zero init state and no other means to save them. For now, FXSAVE is just simple and good enough. [ bp: Fix a typo or two in the text. ] Fixes: 6bad06b76892 ("x86, xsave: Use xsaveopt in context-switch path when supported") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23x86/pkru: Write hardware init value to PKRU when xstate is initThomas Gleixner
commit 510b80a6a0f1a0d114c6e33bcea64747d127973c upstream. When user space brings PKRU into init state, then the kernel handling is broken: T1 user space xsave(state) state.header.xfeatures &= ~XFEATURE_MASK_PKRU; xrstor(state) T1 -> kernel schedule() XSAVE(S) -> T1->xsave.header.xfeatures[PKRU] == 0 T1->flags |= TIF_NEED_FPU_LOAD; wrpkru(); schedule() ... pk = get_xsave_addr(&T1->fpu->state.xsave, XFEATURE_PKRU); if (pk) wrpkru(pk->pkru); else wrpkru(DEFAULT_PKRU); Because the xfeatures bit is 0 and therefore the value in the xsave storage is not valid, get_xsave_addr() returns NULL and switch_to() writes the default PKRU. -> FAIL #1! So that wrecks any copy_to/from_user() on the way back to user space which hits memory which is protected by the default PKRU value. Assumed that this does not fail (pure luck) then T1 goes back to user space and because TIF_NEED_FPU_LOAD is set it ends up in switch_fpu_return() __fpregs_load_activate() if (!fpregs_state_valid()) { load_XSTATE_from_task(); } But if nothing touched the FPU between T1 scheduling out and back in, then the fpregs_state is still valid which means switch_fpu_return() does nothing and just clears TIF_NEED_FPU_LOAD. Back to user space with DEFAULT_PKRU loaded. -> FAIL #2! The fix is simple: if get_xsave_addr() returns NULL then set the PKRU value to 0 instead of the restrictive default PKRU value in init_pkru_value. [ bp: Massage in minor nitpicks from folks. ] Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210608144346.045616965@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23x86/process: Check PF_KTHREAD and not current->mm for kernel threadsThomas Gleixner
commit 12f7764ac61200e32c916f038bdc08f884b0b604 upstream. switch_fpu_finish() checks current->mm as indicator for kernel threads. That's wrong because kernel threads can temporarily use a mm of a user process via kthread_use_mm(). Check the task flags for PF_KTHREAD instead. Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210608144345.912645927@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10x86/kvm: Disable all PV features on crashVitaly Kuznetsov
commit 3d6b84132d2a57b5a74100f6923a8feb679ac2ce upstream. Crash shutdown handler only disables kvmclock and steal time, other PV features remain active so we risk corrupting memory or getting some side-effects in kdump kernel. Move crash handler to kvm.c and unify with CPU offline. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10x86/kvm: Disable kvmclock on all CPUs on shutdownVitaly Kuznetsov
commit c02027b5742b5aa804ef08a4a9db433295533046 upstream. Currenly, we disable kvmclock from machine_shutdown() hook and this only happens for boot CPU. We need to disable it for all CPUs to guard against memory corruption e.g. on restore from hibernate. Note, writing '0' to kvmclock MSR doesn't clear memory location, it just prevents hypervisor from updating the location so for the short while after write and while CPU is still alive, the clock remains usable and correct so we don't need to switch to some other clocksource. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10x86/apic: Mark _all_ legacy interrupts when IO/APIC is missingThomas Gleixner
commit 7d65f9e80646c595e8c853640a9d0768a33e204c upstream. PIC interrupts do not support affinity setting and they can end up on any online CPU. Therefore, it's required to mark the associated vectors as system-wide reserved. Otherwise, the corresponding irq descriptors are copied to the secondary CPUs but the vectors are not marked as assigned or reserved. This works correctly for the IO/APIC case. When the IO/APIC is disabled via config, kernel command line or lack of enumeration then all legacy interrupts are routed through the PIC, but nothing marks them as system-wide reserved vectors. As a consequence, a subsequent allocation on a secondary CPU can result in allocating one of these vectors, which triggers the BUG() in apic_update_vector() because the interrupt descriptor slot is not empty. Imran tried to work around that by marking those interrupts as allocated when a CPU comes online. But that's wrong in case that the IO/APIC is available and one of the legacy interrupts, e.g. IRQ0, has been switched to PIC mode because then marking them as allocated will fail as they are already marked as system vectors. Stay consistent and update the legacy vectors after attempting IO/APIC initialization and mark them as system vectors in case that no IO/APIC is available. Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Reported-by: Imran Khan <imran.f.khan@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()Thomas Gleixner
commit 9bfecd05833918526cc7357d55e393393440c5fa upstream. While digesting the XSAVE-related horrors which got introduced with the supervisor/user split, the recent addition of ENQCMD-related functionality got on the radar and turned out to be similarly broken. update_pasid(), which is only required when X86_FEATURE_ENQCMD is available, is invoked from two places: 1) From switch_to() for the incoming task 2) Via a SMP function call from the IOMMU/SMV code #1 is half-ways correct as it hacks around the brokenness of get_xsave_addr() by enforcing the state to be 'present', but all the conditionals in that code are completely pointless for that. Also the invocation is just useless overhead because at that point it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task and all of this can be handled at return to user space. #2 is broken beyond repair. The comment in the code claims that it is safe to invoke this in an IPI, but that's just wishful thinking. FPU state of a running task is protected by fregs_lock() which is nothing else than a local_bh_disable(). As BH-disabled regions run usually with interrupts enabled the IPI can hit a code section which modifies FPU state and there is absolutely no guarantee that any of the assumptions which are made for the IPI case is true. Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is invoked with a NULL pointer argument, so it can hit a completely unrelated task and unconditionally force an update for nothing. Worse, it can hit a kernel thread which operates on a user space address space and set a random PASID for it. The offending commit does not cleanly revert, but it's sufficient to force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid() code to make this dysfunctional all over the place. Anything more complex would require more surgery and none of the related functions outside of the x86 core code are blatantly wrong, so removing those would be overkill. As nothing enables the PASID bit in the IA32_XSS MSR yet, which is required to make this actually work, this cannot result in a regression except for related out of tree train-wrecks, but they are broken already today. Fixes: 20f0afd1fb3d ("x86/mmu: Allocate/free a PASID") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-19KVM: VMX: Disable preemption when probing user return MSRsSean Christopherson
commit 5104d7ffcf24749939bea7fdb5378d186473f890 upstream. Disable preemption when probing a user return MSR via RDSMR/WRMSR. If the MSR holds a different value per logical CPU, the WRMSR could corrupt the host's value if KVM is preempted between the RDMSR and WRMSR, and then rescheduled on a different CPU. Opportunistically land the helper in common x86, SVM will use the helper in a future commit. Fixes: 4be534102624 ("KVM: VMX: Initialize vmx->guest_msrs[] right after allocation") Cc: stable@vger.kernel.org Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-6-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-19KVM/VMX: Invoke NMI non-IST entry instead of IST entryLai Jiangshan
commit a217a6593cec8b315d4c2f344bae33660b39b703 upstream. In VMX, the host NMI handler needs to be invoked after NMI VM-Exit. Before commit 1a5488ef0dcf6 ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn"), this was done by INTn ("int $2"). But INTn microcode is relatively expensive, so the commit reworked NMI VM-Exit handling to invoke the kernel handler by function call. But this missed a detail. The NMI entry point for direct invocation is fetched from the IDT table and called on the kernel stack. But on 64-bit the NMI entry installed in the IDT expects to be invoked on the IST stack. It relies on the "NMI executing" variable on the IST stack to work correctly, which is at a fixed position in the IST stack. When the entry point is unexpectedly called on the kernel stack, the RSP-addressed "NMI executing" variable is obviously also on the kernel stack and is "uninitialized" and can cause the NMI entry code to run in the wrong way. Provide a non-ist entry point for VMX which shares the C-function with the regular NMI entry and invoke the new asm entry point instead. On 32-bit this just maps to the regular NMI entry point as 32-bit has no ISTs and is not affected. [ tglx: Made it independent for backporting, massaged changelog ] Fixes: 1a5488ef0dcf6 ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Lai Jiangshan <laijs@linux.alibaba.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87r1imi8i1.ffs@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-19KVM: x86/mmu: Remove the defunct update_pte() paging hookSean Christopherson
commit c5e2184d1544f9e56140791eff1a351bea2e63b9 upstream. Remove the update_pte() shadow paging logic, which was obsoleted by commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"), but never removed. As pointed out by Yu, KVM never write protects leaf page tables for the purposes of shadow paging, and instead marks their associated shadow page as unsync so that the guest can write PTEs at will. The update_pte() path, which predates the unsync logic, optimizes COW scenarios by refreshing leaf SPTEs when they are written, as opposed to zapping the SPTE, restarting the guest, and installing the new SPTE on the subsequent fault. Since KVM no longer write-protects leaf page tables, update_pte() is unreachable and can be dropped. Reported-by: Yu Zhang <yu.c.zhang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210115004051.4099250-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-14ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=mVitaly Kuznetsov
commit fa26d0c778b432d3d9814ea82552e813b33eeb5c upstream. Commit 8cdddd182bd7 ("ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()") tried to fix CPU0 hotplug breakage by copying wakeup_cpu0() + start_cpu0() logic from hlt_play_dead()//mwait_play_dead() into acpi_idle_play_dead(). The problem is that these functions are not exported to modules so when CONFIG_ACPI_PROCESSOR=m build fails. The issue could've been fixed by exporting both wakeup_cpu0()/start_cpu0() (the later from assembly) but it seems putting the whole pattern into a new function and exporting it instead is better. Reported-by: kernel test robot <lkp@intel.com> Fixes: 8cdddd182bd7 ("CPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()") Cc: <stable@vger.kernel.org> # 5.10+ Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-07ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()Vitaly Kuznetsov
commit 8cdddd182bd7befae6af49c5fd612893f55d6ccb upstream. Commit 496121c02127 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state") broke CPU0 hotplug on certain systems, e.g. I'm observing the following on AWS Nitro (e.g r5b.xlarge but other instance types are affected as well): # echo 0 > /sys/devices/system/cpu/cpu0/online # echo 1 > /sys/devices/system/cpu/cpu0/online <10 seconds delay> -bash: echo: write error: Input/output error In fact, the above mentioned commit only revealed the problem and did not introduce it. On x86, to wakeup CPU an NMI is being used and hlt_play_dead()/mwait_play_dead() loops are prepared to handle it: /* * If NMI wants to wake up CPU0, start CPU0. */ if (wakeup_cpu0()) start_cpu0(); cpuidle_play_dead() -> acpi_idle_play_dead() (which is now being called on systems where it wasn't called before the above mentioned commit) serves the same purpose but it doesn't have a path for CPU0. What happens now on wakeup is: - NMI is sent to CPU0 - wakeup_cpu0_nmi() works as expected - we get back to while (1) loop in acpi_idle_play_dead() - safe_halt() puts CPU0 to sleep again. The straightforward/minimal fix is add the special handling for CPU0 on x86 and that's what the patch is doing. Fixes: 496121c02127 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: 5.10+ <stable@vger.kernel.org> # 5.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-30Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"Roger Pau Monne
commit af44a387e743ab7aa39d3fb5e29c0a973cf91bdc upstream. This partially reverts commit 882213990d32 ("xen: fix p2m size in dom0 for disabled memory hotplug case") There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order to correctly size the p2m. The generic memory hotplug option has already been tied together with the Xen hotplug limit, so enabling memory hotplug should already trigger a properly sized p2m on Xen PV. Note that XEN_UNPOPULATED_ALLOC depends on ZONE_DEVICE which pulls in MEMORY_HOTPLUG. Leave the check added to __set_phys_to_machine and the adjusted comment about EXTRA_MEM_RATIO. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20210324122424.58685-3-roger.pau@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [boris: fixed formatting issues] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2021-03-30KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ishSean Christopherson
[ Upstream commit b318e8decf6b9ef1bcf4ca06fae6d6a2cb5d5c5c ] Fix a plethora of issues with MSR filtering by installing the resulting filter as an atomic bundle instead of updating the live filter one range at a time. The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as the hardware MSR bitmaps won't be updated until the next VM-Enter, but the relevant software struct is atomically updated, which is what KVM really needs. Similar to the approach used for modifying memslots, make arch.msr_filter a SRCU-protected pointer, do all the work configuring the new filter outside of kvm->lock, and then acquire kvm->lock only when the new filter has been vetted and created. That way vCPU readers either see the old filter or the new filter in their entirety, not some half-baked state. Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a TOCTOU bug, but that's just the tip of the iceberg... - Nothing is __rcu annotated, making it nigh impossible to audit the code for correctness. - kvm_add_msr_filter() has an unpaired smp_wmb(). Violation of kernel coding style aside, the lack of a smb_rmb() anywhere casts all code into doubt. - kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs count before taking the lock. - kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug. The entire approach of updating the live filter is also flawed. While installing a new filter is inherently racy if vCPUs are running, fixing the above issues also makes it trivial to ensure certain behavior is deterministic, e.g. KVM can provide deterministic behavior for MSRs with identical settings in the old and new filters. An atomic update of the filter also prevents KVM from getting into a half-baked state, e.g. if installing a filter fails, the existing approach would leave the filter in a half-baked state, having already committed whatever bits of the filter were already processed. [*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com Fixes: 1a155254ff93 ("KVM: x86: Introduce MSR filtering") Cc: stable@vger.kernel.org Cc: Alexander Graf <graf@amazon.com> Reported-by: Yuan Yao <yaoyuan0329os@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210316184436.2544875-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30static_call: Allow module use without exposing static_call_keyJosh Poimboeuf
[ Upstream commit 73f44fe19d359635a607e8e8daa0da4001c1cfc2 ] When exporting static_call_key; with EXPORT_STATIC_CALL*(), the module can use static_call_update() to change the function called. This is not desirable in general. Not exporting static_call_key however also disallows usage of static_call(), since objtool needs the key to construct the static_call_site. Solve this by allowing objtool to create the static_call_site using the trampoline address when it builds a module and cannot find the static_call_key symbol. The module loader will then try and map the trampole back to a key before it constructs the normal sites list. Doing this requires a trampoline -> key associsation, so add another magic section that keeps those. Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210127231837.ifddpn7rhwdaepiu@treble Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-25x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()Oleg Nesterov
commit 8c150ba2fb5995c84a7a43848250d444a3329a7d upstream. The comment in get_nr_restart_syscall() says: * The problem is that we can get here when ptrace pokes * syscall-like values into regs even if we're not in a syscall * at all. Yes, but if not in a syscall then the status & (TS_COMPAT|TS_I386_REGS_POKED) check below can't really help: - TS_COMPAT can't be set - TS_I386_REGS_POKED is only set if regs->orig_ax was changed by 32bit debugger; and even in this case get_nr_restart_syscall() is only correct if the tracee is 32bit too. Suppose that a 64bit debugger plays with a 32bit tracee and * Tracee calls sleep(2) // TS_COMPAT is set * User interrupts the tracee by CTRL-C after 1 sec and does "(gdb) call func()" * gdb saves the regs by PTRACE_GETREGS * does PTRACE_SETREGS to set %rip='func' and %orig_rax=-1 * PTRACE_CONT // TS_COMPAT is cleared * func() hits int3. * Debugger catches SIGTRAP. * Restore original regs by PTRACE_SETREGS. * PTRACE_CONT get_nr_restart_syscall() wrongly returns __NR_restart_syscall==219, the tracee calls ia32_sys_call_table[219] == sys_madvise. Add the sticky TS_COMPAT_RESTART flag which survives after return to user mode. It's going to be removed in the next step again by storing the information in the restart block. As a further cleanup it might be possible to remove also TS_I386_REGS_POKED with that. Test-case: $ cvs -d :pserver:anoncvs:anoncvs@sourceware.org:/cvs/systemtap co ptrace-tests $ gcc -o erestartsys-trap-debuggee ptrace-tests/tests/erestartsys-trap-debuggee.c --m32 $ gcc -o erestartsys-trap-debugger ptrace-tests/tests/erestartsys-trap-debugger.c -lutil $ ./erestartsys-trap-debugger Unexpected: retval 1, errno 22 erestartsys-trap-debugger: ptrace-tests/tests/erestartsys-trap-debugger.c:421 Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code") Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210201174709.GA17895@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-25x86: Move TS_COMPAT back to asm/thread_info.hOleg Nesterov
commit 66c1b6d74cd7035e85c426f0af4aede19e805c8a upstream. Move TS_COMPAT back to asm/thread_info.h, close to TS_I386_REGS_POKED. It was moved to asm/processor.h by b9d989c7218a ("x86/asm: Move the thread_info::status field to thread_struct"), then later 37a8f7c38339 ("x86/asm: Move 'status' from thread_struct to thread_info") moved the 'status' field back but TS_COMPAT was forgotten. Preparatory patch to fix the COMPAT case for get_nr_restart_syscall() Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210201174649.GA17880@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17x86/sev-es: Use __copy_from_user_inatomic()Joerg Roedel
commit bffe30dd9f1f3b2608a87ac909a224d6be472485 upstream. The #VC handler must run in atomic context and cannot sleep. This is a problem when it tries to fetch instruction bytes from user-space via copy_from_user(). Introduce a insn_fetch_from_user_inatomic() helper which uses __copy_from_user_inatomic() to safely copy the instruction bytes to kernel memory in the #VC handler. Fixes: 5e3427a7bc432 ("x86/sev-es: Handle instruction fetches from user-space") Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # v5.10+ Link: https://lkml.kernel.org/r/20210303141716.29223-6-joro@8bytes.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17x86/entry: Move nmi entry/exit into common codeThomas Gleixner
commit b6be002bcd1dd1dedb926abf3c90c794eacb77dc upstream. Lockdep state handling on NMI enter and exit is nothing specific to X86. It's not any different on other architectures. Also the extra state type is not necessary, irqentry_state_t can carry the necessary information as well. Move it to common code and extend irqentry_state_t to carry lockdep state. [ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive between the IRQ and NMI exceptions, and add kernel documentation for struct irqentry_state_t ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>