aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-05-28Merge tag 'v5.4.43' into v5.4/standard/baseBruce Ashfield
This is the 5.4.43 stable release # gpg: Signature made Wed 27 May 2020 11:46:53 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-05-27x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasksJosh Poimboeuf
commit 187b96db5ca79423618dfa29a05c438c34f9e1f0 upstream. Normally, show_trace_log_lvl() scans the stack, looking for text addresses to print. In parallel, it unwinds the stack with unwind_next_frame(). If the stack address matches the pointer returned by unwind_get_return_address_ptr() for the current frame, the text address is printed normally without a question mark. Otherwise it's considered a breadcrumb (potentially from a previous call path) and it's printed with a question mark to indicate that the address is unreliable and typically can be ignored. Since the following commit: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") ... for inactive tasks, show_trace_log_lvl() prints *only* unreliable addresses (prepended with '?'). That happens because, for the first frame of an inactive task, unwind_get_return_address_ptr() returns the wrong return address pointer: one word *below* the task stack pointer. show_trace_log_lvl() starts scanning at the stack pointer itself, so it never finds the first 'reliable' address, causing only guesses to being printed. The first frame of an inactive task isn't a normal stack frame. It's actually just an instance of 'struct inactive_task_frame' which is left behind by __switch_to_asm(). Now that this inactive frame is actually exposed to callers, fix unwind_get_return_address_ptr() to interpret it properly. Fixes: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200522135435.vbxs7umku5pyrdbk@treble Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.cBabu Moger
commit 37486135d3a7b03acc7755b63627a130437f066a upstream. Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU resource isn't. It can be read with XSAVE and written with XRSTOR. So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state), the guest can read the host value. In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could potentially use XRSTOR to change the host PKRU value. While at it, move pkru state save/restore to common code and the host_pkru field to kvm_vcpu_arch. This will let SVM support protection keys. Cc: stable@vger.kernel.org Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27x86/apic: Move TSC deadline timer debug printkThomas Gleixner
[ Upstream commit c84cb3735fd53c91101ccdb191f2e3331a9262cb ] Leon reported that the printk_once() in __setup_APIC_LVTT() triggers a lockdep splat due to a lock order violation between hrtimer_base::lock and console_sem, when the 'once' condition is reset via /sys/kernel/debug/clear_warn_once after boot. The initial printk cannot trigger this because that happens during boot when the local APIC timer is set up on the boot CPU. Prevent it by moving the printk to a place which is guaranteed to be only called once during boot. Mark the deadline timer check related functions and data __init while at it. Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87y2qhoshi.fsf@nanos.tec.linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27x86/mm/cpa: Flush direct map alias during cpaRick Edgecombe
[ Upstream commit ab5130186d7476dcee0d4e787d19a521ca552ce9 ] As an optimization, cpa_flush() was changed to optionally only flush the range in @cpa if it was small enough. However, this range does not include any direct map aliases changed in cpa_process_alias(). So small set_memory_() calls that touch that alias don't get the direct map changes flushed. This situation can happen when the virtual address taking variants are passed an address in vmalloc or modules space. In these cases, force a full TLB flush. Note this issue does not extend to cases where the set_memory_() calls are passed a direct map address, or page array, etc, as the primary target. In those cases the direct map would be flushed. Fixes: 935f5839827e ("x86/mm/cpa: Optimize cpa_flush_array() TLB invalidation") Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200424105343.GA20730@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27KVM: SVM: Fix potential memory leak in svm_cpu_init()Miaohe Lin
commit d80b64ff297e40c2b6f7d7abc1b3eba70d22a068 upstream. When kmalloc memory for sd->sev_vmcbs failed, we forget to free the page held by sd->save_area. Also get rid of the var r as '-ENOMEM' is actually the only possible outcome here. Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-25Merge tag 'v5.4.42' into v5.4/standard/baseBruce Ashfield
This is the 5.4.42 stable release Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-05-25Merge tag 'v5.4.41' into v5.4/standard/baseBruce Ashfield
This is the 5.4.41 stable release # gpg: Signature made Thu 14 May 2020 01:58:30 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-05-20KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mceJim Mattson
commit c4e0e4ab4cf3ec2b3f0b628ead108d677644ebd9 upstream. Bank_num is a one-based count of banks, not a zero-based index. It overflows the allocated space only when strictly greater than KVM_MAX_MCE_BANKS. Fixes: a9e38c3e01ad ("KVM: x86: Catch potential overrun in MCE setup") Signed-off-by: Jue Wang <juew@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Message-Id: <20200511225616.19557-1-jmattson@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20x86/unwind/orc: Fix error handling in __unwind_start()Josh Poimboeuf
commit 71c95825289f585014fe9741b051d32a7a916680 upstream. The unwind_state 'error' field is used to inform the reliable unwinding code that the stack trace can't be trusted. Set this field for all errors in __unwind_start(). Also, move the zeroing out of the unwind_state struct to before the ORC table initialization check, to prevent the caller from reading uninitialized data if the ORC table is corrupted. Fixes: af085d9084b4 ("stacktrace/x86: add function for detecting reliable stack traces") Fixes: d3a09104018c ("x86/unwinder/orc: Dont bail on stack overflow") Fixes: 98d0c8ebf77e ("x86/unwind/orc: Prevent unwinding before ORC initialization") Reported-by: Pavel Machek <pavel@denx.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/d6ac7215a84ca92b895fdd2e1aa546729417e6e6.1589487277.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20x86: Fix early boot crash on gcc-10, third tryBorislav Petkov
commit a9a3ed1eff3601b63aea4fb462d8b3b92c7c1e7e upstream. ... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kalle Valo <kvalo@codeaurora.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory()Janakarajan Natarajan
commit 996ed22c7a5251d76dcdfe5026ef8230e90066d9 upstream. When trying to lock read-only pages, sev_pin_memory() fails because FOLL_WRITE is used as the flag for get_user_pages_fast(). Commit 73b0140bf0fe ("mm/gup: change GUP fast to use flags rather than a write 'bool'") updated the get_user_pages_fast() call sites to use flags, but incorrectly updated the call in sev_pin_memory(). As the original coding of this call was correct, revert the change made by that commit. Fixes: 73b0140bf0fe ("mm/gup: change GUP fast to use flags rather than a write 'bool'") Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Mike Marshall <hubcap@omnibond.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14KVM: x86: Fixes posted interrupt check for IRQs delivery modesSuravee Suthikulpanit
commit 637543a8d61c6afe4e9be64bfb43c78701a83375 upstream. Current logic incorrectly uses the enum ioapic_irq_destination_types to check the posted interrupt destination types. However, the value was set using APIC_DM_XXX macros, which are left-shifted by 8 bits. Fixes by using the APIC_DM_FIXED and APIC_DM_LOWEST instead. Fixes: (fdcf75621375 'KVM: x86: Disable posted interrupts for non-standard IRQs delivery modes') Cc: Alexander Graf <graf@amazon.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1586239989-58305-1-git-send-email-suravee.suthikulpanit@amd.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/unwind/orc: Fix premature unwind stoppage due to IRET framesJosh Poimboeuf
commit 81b67439d147677d844d492fcbd03712ea438f42 upstream. The following execution path is possible: fsnotify() [ realign the stack and store previous SP in R10 ] <IRQ> [ only IRET regs saved ] common_interrupt() interrupt_entry() <NMI> [ full pt_regs saved ] ... [ unwind stack ] When the unwinder goes through the NMI and the IRQ on the stack, and then sees fsnotify(), it doesn't have access to the value of R10, because it only has the five IRET registers. So the unwind stops prematurely. However, because the interrupt_entry() code is careful not to clobber R10 before saving the full regs, the unwinder should be able to read R10 from the previously saved full pt_regs associated with the NMI. Handle this case properly. When encountering an IRET regs frame immediately after a full pt_regs frame, use the pt_regs as a backup which can be used to get the C register values. Also, note that a call frame resets the 'prev_regs' value, because a function is free to clobber the registers. For this fix to work, the IRET and full regs frames must be adjacent, with no FUNC frames in between. So replace the FUNC hint in interrupt_entry() with an IRET_REGS hint. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/97a408167cc09f1cfa0de31a7b70dd88868d743f.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/unwind/orc: Fix error path for bad ORC entry typeJosh Poimboeuf
commit a0f81bf26888048100bf017fadf438a5bdffa8d8 upstream. If the ORC entry type is unknown, nothing else can be done other than reporting an error. Exit the function instead of breaking out of the switch statement. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/a7fa668ca6eabbe81ab18b2424f15adbbfdc810a.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/unwind/orc: Prevent unwinding before ORC initializationJosh Poimboeuf
commit 98d0c8ebf77e0ba7c54a9ae05ea588f0e9e3f46e upstream. If the unwinder is called before the ORC data has been initialized, orc_find() returns NULL, and it tries to fall back to using frame pointers. This can cause some unexpected warnings during boot. Move the 'orc_init' check from orc_find() to __unwind_init(), so that it doesn't even try to unwind from an uninitialized state. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/069d1499ad606d85532eb32ce39b2441679667d5.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/unwind/orc: Don't skip the first frame for inactive tasksMiroslav Benes
commit f1d9a2abff66aa8156fbc1493abed468db63ea48 upstream. When unwinding an inactive task, the ORC unwinder skips the first frame by default. If both the 'regs' and 'first_frame' parameters of unwind_start() are NULL, 'state->sp' and 'first_frame' are later initialized to the same value for an inactive task. Given there is a "less than or equal to" comparison used at the end of __unwind_start() for skipping stack frames, the first frame is skipped. Drop the equal part of the comparison and make the behavior equivalent to the frame pointer unwinder. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/7f08db872ab59e807016910acdbe82f744de7065.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/entry/64: Fix unwind hints in rewind_stack_do_exit()Jann Horn
commit f977df7b7ca45a4ac4b66d30a8931d0434c394b1 upstream. The LEAQ instruction in rewind_stack_do_exit() moves the stack pointer directly below the pt_regs at the top of the task stack before calling do_exit(). Tell the unwinder to expect pt_regs. Fixes: 8c1f75587a18 ("x86/entry/64: Add unwind hint annotations") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/68c33e17ae5963854916a46f522624f8e1d264f2.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/entry/64: Fix unwind hints in kernel exit pathJosh Poimboeuf
commit 1fb143634a38095b641a3a21220774799772dc4c upstream. In swapgs_restore_regs_and_return_to_usermode, after the stack is switched to the trampoline stack, the existing UNWIND_HINT_REGS hint is no longer valid, which can result in the following ORC unwinder warning: WARNING: can't dereference registers at 000000003aeb0cdd for ip swapgs_restore_regs_and_return_to_usermode+0x93/0xa0 For full correctness, we could try to add complicated unwind hints so the unwinder could continue to find the registers, but when when it's this close to kernel exit, unwind hints aren't really needed anymore and it's fine to just use an empty hint which tells the unwinder to stop. For consistency, also move the UNWIND_HINT_EMPTY in entry_SYSCALL_64_after_hwframe to a similar location. Fixes: 3e3b9293d392 ("x86/entry/64: Return to userspace from the trampoline stack") Reported-by: Vince Weaver <vincent.weaver@maine.edu> Reported-by: Dave Jones <dsj@fb.com> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reported-by: Joe Mario <jmario@redhat.com> Reported-by: Jann Horn <jannh@google.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/60ea8f562987ed2d9ace2977502fe481c0d7c9a0.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/entry/64: Fix unwind hints in register clearing codeJosh Poimboeuf
commit 06a9750edcffa808494d56da939085c35904e618 upstream. The PUSH_AND_CLEAR_REGS macro zeroes each register immediately after pushing it. If an NMI or exception hits after a register is cleared, but before the UNWIND_HINT_REGS annotation, the ORC unwinder will wrongly think the previous value of the register was zero. This can confuse the unwinding process and cause it to exit early. Because ORC is simpler than DWARF, there are a limited number of unwind annotation states, so it's not possible to add an individual unwind hint after each push/clear combination. Instead, the register clearing instructions need to be consolidated and moved to after the UNWIND_HINT_REGS annotation. Fixes: 3f01daecd545 ("x86/entry/64: Introduce the PUSH_AND_CLEAN_REGS macro") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/68fd3d0bc92ae2d62ff7879d15d3684217d51f08.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB pathSean Christopherson
commit c7cb2d650c9e78c03bd2d1c0db89891825f8c0f4 upstream. Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so that KVM doesn't interpret clobbered RFLAGS as a VM-Fail. Filling the RSB has always clobbered RFLAGS, its current incarnation just happens clear CF and ZF in the processs. Relying on the macro to clear CF and ZF is extremely fragile, e.g. commit 089dd8e53126e ("x86/speculation: Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such that the ZF flag is always set. Reported-by: Qian Cai <cai@lca.pw> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: stable@vger.kernel.org Fixes: f2fde6a5bcfcf ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506035355.2242-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14crypto: arch/nhpoly1305 - process in explicit 4k chunksJason A. Donenfeld
commit a9a8ba90fa5857c2c8a0e32eef2159cec717da11 upstream. Rather than chunking via PAGE_SIZE, this commit changes the arch implementations to chunk in explicit 4k parts, so that calculations on maximum acceptable latency don't suddenly become invalid on platforms where PAGE_SIZE isn't 4k, such as arm64. Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305") Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-11Merge tag 'v5.4.40' into v5.4/standard/baseBruce Ashfield
This is the 5.4.40 stable release # gpg: Signature made Sun 10 May 2020 04:31:34 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-05-10x86/kvm: fix a missing-prototypes "vmread_error"Qian Cai
commit 514ccc194971d0649e4e7ec8a9b3a6e33561d7bf upstream. The commit 842f4be95899 ("KVM: VMX: Add a trampoline to fix VMREAD error handling") removed the declaration of vmread_error() causes a W=1 build failure with KVM_WERROR=y. Fix it by adding it back. arch/x86/kvm/vmx/vmx.c:359:17: error: no previous prototype for 'vmread_error' [-Werror=missing-prototypes] asmlinkage void vmread_error(unsigned long field, bool fault) ^~~~~~~~~~~~ Signed-off-by: Qian Cai <cai@lca.pw> Message-Id: <20200402153955.1695-1-cai@lca.pw> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02Merge tag 'v5.4.37' into v5.4/standard/baseBruce Ashfield
This is the 5.4.37 stable release # gpg: Signature made Sat 02 May 2020 02:49:34 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-05-02Merge tag 'v5.4.36' into v5.4/standard/baseBruce Ashfield
This is the 5.4.36 stable release # gpg: Signature made Wed 29 Apr 2020 10:33:25 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-05-02Merge tag 'v5.4.35' into v5.4/standard/baseBruce Ashfield
This is the 5.4.35 stable release # gpg: Signature made Thu 23 Apr 2020 04:36:46 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-05-02x86: hyperv: report value of misc_featuresOlaf Hering
[ Upstream commit 97d9f1c43bedd400301d6f1eff54d46e8c636e47 ] A few kernel features depend on ms_hyperv.misc_features, but unlike its siblings ->features and ->hints, the value was never reported during boot. Signed-off-by: Olaf Hering <olaf@aepfle.de> Link: https://lore.kernel.org/r/20200407172739.31371-1-olaf@aepfle.de Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-02bpf, x86: Fix encoding for lower 8-bit registers in BPF_STX BPF_BLuke Nelson
[ Upstream commit aee194b14dd2b2bde6252b3acf57d36dccfc743a ] This patch fixes an encoding bug in emit_stx for BPF_B when the source register is BPF_REG_FP. The current implementation for BPF_STX BPF_B in emit_stx saves one REX byte when the operands can be encoded using Mod-R/M alone. The lower 8 bits of registers %rax, %rbx, %rcx, and %rdx can be accessed without using a REX prefix via %al, %bl, %cl, and %dl, respectively. Other registers, (e.g., %rsi, %rdi, %rbp, %rsp) require a REX prefix to use their 8-bit equivalents (%sil, %dil, %bpl, %spl). The current code checks if the source for BPF_STX BPF_B is BPF_REG_1 or BPF_REG_2 (which map to %rdi and %rsi), in which case it emits the required REX prefix. However, it misses the case when the source is BPF_REG_FP (mapped to %rbp). The result is that BPF_STX BPF_B with BPF_REG_FP as the source operand will read from register %ch instead of the correct %bpl. This patch fixes the problem by fixing and refactoring the check on which registers need the extra REX byte. Since no BPF registers map to %rsp, there is no need to handle %spl. Fixes: 622582786c9e0 ("net: filter: x86: internal BPF JIT") Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200418232655.23870-1-luke.r.nels@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-02bpf, x86_32: Fix logic error in BPF_LDX zero-extensionWang YanQing
commit 5ca1ca01fae1e90f8d010eb1d83374f28dc11ee6 upstream. When verifier_zext is true, we don't need to emit code for zero-extension. Fixes: 836256bf5f37 ("x32: bpf: eliminate zero extension code-gen") Signed-off-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200423050637.GA4029@udknight Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02bpf, x86_32: Fix clobbering of dst for BPF_JSETLuke Nelson
commit 50fe7ebb6475711c15b3397467e6424e20026d94 upstream. The current JIT clobbers the destination register for BPF_JSET BPF_X and BPF_K by using "and" and "or" instructions. This is fine when the destination register is a temporary loaded from a register stored on the stack but not otherwise. This patch fixes the problem (for both BPF_K and BPF_X) by always loading the destination register into temporaries since BPF_JSET should not modify the destination register. This bug may not be currently triggerable as BPF_REG_AX is the only register not stored on the stack and the verifier uses it in a limited way. Fixes: 03f5781be2c7b ("bpf, x86_32: add eBPF JIT compiler for ia32") Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Wang YanQing <udknight@gmail.com> Link: https://lore.kernel.org/bpf/20200422173630.8351-2-luke.r.nels@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extensionLuke Nelson
commit 5fa9a98fb10380e48a398998cd36a85e4ef711d6 upstream. The current JIT uses the following sequence to zero-extend into the upper 32 bits of the destination register for BPF_LDX BPF_{B,H,W}, when the destination register is not on the stack: EMIT3(0xC7, add_1reg(0xC0, dst_hi), 0); The problem is that C7 /0 encodes a MOV instruction that requires a 4-byte immediate; the current code emits only 1 byte of the immediate. This means that the first 3 bytes of the next instruction will be treated as the rest of the immediate, breaking the stream of instructions. This patch fixes the problem by instead emitting "xor dst_hi,dst_hi" to clear the upper 32 bits. This fixes the problem and is more efficient than using MOV to load a zero immediate. This bug may not be currently triggerable as BPF_REG_AX is the only register not stored on the stack and the verifier uses it in a limited way, and the verifier implements a zero-extension optimization. But the JIT should avoid emitting incorrect encodings regardless. Fixes: 03f5781be2c7b ("bpf, x86_32: add eBPF JIT compiler for ia32") Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Acked-by: Wang YanQing <udknight@gmail.com> Link: https://lore.kernel.org/bpf/20200422173630.8351-1-luke.r.nels@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29KVM: VMX: Enable machine check support for 32bit targetsUros Bizjak
commit fb56baae5ea509e63c2a068d66a4d8ea91969fca upstream. There is no reason to limit the use of do_machine_check to 64bit targets. MCE handling works for both target familes. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Fixes: a0861c02a981 ("KVM: Add VT-x machine check support") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200414071414.45636-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-23x86: ACPI: fix CPU hotplug deadlockQian Cai
[ Upstream commit 696ac2e3bf267f5a2b2ed7d34e64131f2287d0ad ] Similar to commit 0266d81e9bf5 ("acpi/processor: Prevent cpu hotplug deadlock") except this is for acpi_processor_ffh_cstate_probe(): "The problem is that the work is scheduled on the current CPU from the hotplug thread associated with that CPU. It's not required to invoke these functions via the workqueue because the hotplug thread runs on the target CPU already. Check whether current is a per cpu thread pinned on the target CPU and invoke the function directly to avoid the workqueue." WARNING: possible circular locking dependency detected ------------------------------------------------------ cpuhp/1/15 is trying to acquire lock: ffffc90003447a28 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x4c6/0x630 but task is already holding lock: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x3e/0xc0 irq_calc_affinity_vectors+0x5f/0x91 __pci_enable_msix_range+0x10f/0x9a0 pci_alloc_irq_vectors_affinity+0x13e/0x1f0 pci_alloc_irq_vectors_affinity at drivers/pci/msi.c:1208 pqi_ctrl_init+0x72f/0x1618 [smartpqi] pqi_pci_probe.cold.63+0x882/0x892 [smartpqi] local_pci_probe+0x7a/0xc0 work_for_cpu_fn+0x2e/0x50 process_one_work+0x57e/0xb90 worker_thread+0x363/0x5b0 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __lock_acquire+0x2244/0x32a0 lock_acquire+0x1a2/0x680 __flush_work+0x4e6/0x630 work_on_cpu+0x114/0x160 acpi_processor_ffh_cstate_probe+0x129/0x250 acpi_processor_evaluate_cst+0x4c8/0x580 acpi_processor_get_power_info+0x86/0x740 acpi_processor_hotplug+0xc3/0x140 acpi_soft_cpu_online+0x102/0x1d0 cpuhp_invoke_callback+0x197/0x1120 cpuhp_thread_fun+0x252/0x2f0 smpboot_thread_fn+0x255/0x440 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 other info that might help us debug this: Chain exists of: (work_completion)(&wfc.work) --> cpuhp_state-up --> cpuidle_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuidle_lock); lock(cpuhp_state-up); lock(cpuidle_lock); lock((work_completion)(&wfc.work)); *** DEADLOCK *** 3 locks held by cpuhp/1/15: #0: ffffffffaf51ab10 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0 #1: ffffffffaf51ad40 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0 #2: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20 Call Trace: dump_stack+0xa0/0xea print_circular_bug.cold.52+0x147/0x14c check_noncircular+0x295/0x2d0 __lock_acquire+0x2244/0x32a0 lock_acquire+0x1a2/0x680 __flush_work+0x4e6/0x630 work_on_cpu+0x114/0x160 acpi_processor_ffh_cstate_probe+0x129/0x250 acpi_processor_evaluate_cst+0x4c8/0x580 acpi_processor_get_power_info+0x86/0x740 acpi_processor_hotplug+0xc3/0x140 acpi_soft_cpu_online+0x102/0x1d0 cpuhp_invoke_callback+0x197/0x1120 cpuhp_thread_fun+0x252/0x2f0 smpboot_thread_fn+0x255/0x440 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 Signed-off-by: Qian Cai <cai@lca.pw> Tested-by: Borislav Petkov <bp@suse.de> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23x86/Hyper-V: Report crash data in die() when panic_on_oops is setTianyu Lan
commit f3a99e761efa616028b255b4de58e9b5b87c5545 upstream. When oops happens with panic_on_oops unset, the oops thread is killed by die() and system continues to run. In such case, guest should not report crash register data to host since system still runs. Check panic_on_oops and return directly in hyperv_report_panic() when the function is called in the die() and panic_on_oops is unset. Fix it. Fixes: 7ed4325a44ea ("Drivers: hv: vmbus: Make panic reporting to be more useful") Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20200406155331.2105-7-Tianyu.Lan@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-23x86/Hyper-V: Report crash register data or kmsg before running crash kernelTianyu Lan
commit a11589563e96bf262767294b89b25a9d44e7303b upstream. We want to notify Hyper-V when a Linux guest VM crash occurs, so there is a record of the crash even when kdump is enabled. But crash_kexec_post_notifiers defaults to "false", so the kdump kernel runs before the notifiers and Hyper-V never gets notified. Fix this by always setting crash_kexec_post_notifiers to be true for Hyper-V VMs. Fixes: 81b18bce48af ("Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic") Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Link: https://lore.kernel.org/r/20200406155331.2105-5-Tianyu.Lan@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-22Merge tag 'v5.4.34' into v5.4/standard/baseBruce Ashfield
This is the 5.4.34 stable release # gpg: Signature made Tue 21 Apr 2020 03:05:05 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-04-22Merge tag 'v5.4.33' into v5.4/standard/baseBruce Ashfield
This is the 5.4.33 stable release # gpg: Signature made Fri 17 Apr 2020 04:50:26 AM EDT # gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E # gpg: Can't check signature: No public key
2020-04-21x86/microcode/AMD: Increase microcode PATCH_MAX_SIZEJohn Allen
commit bdf89df3c54518eed879d8fac7577fcfb220c67e upstream. Future AMD CPUs will have microcode patches that exceed the default 4K patch size. Raise our limit. Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # v4.14.. Link: https://lkml.kernel.org/r/20200409152931.GA685273@mojo.amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21x86/resctrl: Fix invalid attempt at removing the default resource groupReinette Chatre
commit b0151da52a6d4f3951ea24c083e7a95977621436 upstream. The default resource group ("rdtgroup_default") is associated with the root of the resctrl filesystem and should never be removed. New resource groups can be created as subdirectories of the resctrl filesystem and they can be removed from user space. There exists a safeguard in the directory removal code (rdtgroup_rmdir()) that ensures that only subdirectories can be removed by testing that the directory to be removed has to be a child of the root directory. A possible deadlock was recently fixed with 334b0f4e9b1b ("x86/resctrl: Fix a deadlock due to inaccurate reference"). This fix involved associating the private data of the "mon_groups" and "mon_data" directories to the resource group to which they belong instead of NULL as before. A consequence of this change was that the original safeguard code preventing removal of "mon_groups" and "mon_data" found in the root directory failed resulting in attempts to remove the default resource group that ends in a BUG: kernel BUG at mm/slub.c:3969! invalid opcode: 0000 [#1] SMP PTI Call Trace: rdtgroup_rmdir+0x16b/0x2c0 kernfs_iop_rmdir+0x5c/0x90 vfs_rmdir+0x7a/0x160 do_rmdir+0x17d/0x1e0 do_syscall_64+0x55/0x1d0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by improving the directory removal safeguard to ensure that subdirectories of the resctrl root directory can only be removed if they are a child of the resctrl filesystem's root _and_ not associated with the default resource group. Fixes: 334b0f4e9b1b ("x86/resctrl: Fix a deadlock due to inaccurate reference") Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/884cbe1773496b5dbec1b6bd11bb50cffa83603d.1584461853.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21x86/resctrl: Preserve CDP enable over CPU hotplugJames Morse
commit 9fe0450785abbc04b0ed5d3cf61fcdb8ab656b4b upstream. Resctrl assumes that all CPUs are online when the filesystem is mounted, and that CPUs remember their CDP-enabled state over CPU hotplug. This goes wrong when resctrl's CDP-enabled state changes while all the CPUs in a domain are offline. When a domain comes online, enable (or disable!) CDP to match resctrl's current setting. Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support") Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200221162105.154163-1-james.morse@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17efi/x86: Fix the deletion of variables in mixed modeGary Lin
[ Upstream commit a4b81ccfd4caba017d2b84720b6de4edd16911a0 ] efi_thunk_set_variable() treated the NULL "data" pointer as an invalid parameter, and this broke the deletion of variables in mixed mode. This commit fixes the check of data so that the userspace program can delete a variable in mixed mode. Fixes: 8319e9d5ad98ffcc ("efi/x86: Handle by-ref arguments covering multiple pages in mixed mode") Signed-off-by: Gary Lin <glin@suse.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200408081606.1504-1-glin@suse.com Link: https://lore.kernel.org/r/20200409130434.6736-9-ardb@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17KVM: VMX: fix crash cleanup when KVM wasn't usedVitaly Kuznetsov
commit dbef2808af6c594922fe32833b30f55f35e9da6d upstream. If KVM wasn't used at all before we crash the cleanup procedure fails with BUG: unable to handle page fault for address: ffffffffffffffc8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 23215067 P4D 23215067 PUD 23217067 PMD 0 Oops: 0000 [#8] SMP PTI CPU: 0 PID: 3542 Comm: bash Kdump: loaded Tainted: G D 5.6.0-rc2+ #823 RIP: 0010:crash_vmclear_local_loaded_vmcss.cold+0x19/0x51 [kvm_intel] The root cause is that loaded_vmcss_on_cpu list is not yet initialized, we initialize it in hardware_enable() but this only happens when we start a VM. Previously, we used to have a bitmap with enabled CPUs and that was preventing [masking] the issue. Initialized loaded_vmcss_on_cpu list earlier, right before we assign crash_vmclear_loaded_vmcss pointer. blocked_vcpu_on_cpu list and blocked_vcpu_on_cpu_lock are moved altogether for consistency. Fixes: 31603d4fc2bb ("KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200401081348.1345307-1-vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: VMX: Add a trampoline to fix VMREAD error handlingSean Christopherson
commit 842f4be95899df22b5843ba1a7c8cf37e831a6e8 upstream. Add a hand coded assembly trampoline to preserve volatile registers across vmread_error(), and to handle the calling convention differences between 64-bit and 32-bit due to asmlinkage on vmread_error(). Pass @field and @fault on the stack when invoking the trampoline to avoid clobbering volatile registers in the context of the inline assembly. Calling vmread_error() directly from inline assembly is partially broken on 64-bit, and completely broken on 32-bit. On 64-bit, it will clobber %rdi and %rsi (used to pass @field and @fault) and any volatile regs written by vmread_error(). On 32-bit, asmlinkage means vmread_error() expects the parameters to be passed on the stack, not via regs. Opportunistically zero out the result in the trampoline to save a few bytes of code for every VMREAD. A happy side effect of the trampoline is that the inline code footprint is reduced by three bytes on 64-bit due to PUSH/POP being more efficent (in terms of opcode bytes) than MOV. Fixes: 6e2020977e3e6 ("KVM: VMX: Add error handling to VMREAD helper") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200326160712.28803-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: x86: Gracefully handle __vmalloc() failure during VM allocationSean Christopherson
commit d18b2f43b9147c8005ae0844fb445d8cc6a87e31 upstream. Check the result of __vmalloc() to avoid dereferencing a NULL pointer in the event that allocation failres. Fixes: d1e5b0e98ea27 ("kvm: Make VM ioctl do valloc for some archs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec supportSean Christopherson
commit 31603d4fc2bb4f0815245d496cb970b27b4f636a upstream. VMCLEAR all in-use VMCSes during a crash, even if kdump's NMI shootdown interrupted a KVM update of the percpu in-use VMCS list. Because NMIs are not blocked by disabling IRQs, it's possible that crash_vmclear_local_loaded_vmcss() could be called while the percpu list of VMCSes is being modified, e.g. in the middle of list_add() in vmx_vcpu_load_vmcs(). This potential corner case was called out in the original commit[*], but the analysis of its impact was wrong. Skipping the VMCLEARs is wrong because it all but guarantees that a loaded, and therefore cached, VMCS will live across kexec and corrupt memory in the new kernel. Corruption will occur because the CPU's VMCS cache is non-coherent, i.e. not snooped, and so the writeback of VMCS memory on its eviction will overwrite random memory in the new kernel. The VMCS will live because the NMI shootdown also disables VMX, i.e. the in-progress VMCLEAR will #UD, and existing Intel CPUs do not flush the VMCS cache on VMXOFF. Furthermore, interrupting list_add() and list_del() is safe due to crash_vmclear_local_loaded_vmcss() using forward iteration. list_add() ensures the new entry is not visible to forward iteration unless the entire add completes, via WRITE_ONCE(prev->next, new). A bad "prev" pointer could be observed if the NMI shootdown interrupted list_del() or list_add(), but list_for_each_entry() does not consume ->prev. In addition to removing the temporary disabling of VMCLEAR, open code loaded_vmcs_init() in __loaded_vmcs_clear() and reorder VMCLEAR so that the VMCS is deleted from the list only after it's been VMCLEAR'd. Deleting the VMCS before VMCLEAR would allow a race where the NMI shootdown could arrive between list_del() and vmcs_clear() and thus neither flow would execute a successful VMCLEAR. Alternatively, more code could be moved into loaded_vmcs_init(), but that gets rather silly as the only other user, alloc_loaded_vmcs(), doesn't need the smp_wmb() and would need to work around the list_del(). Update the smp_*() comments related to the list manipulation, and opportunistically reword them to improve clarity. [*] https://patchwork.kernel.org/patch/1675731/#3720461 Fixes: 8f536b7697a0 ("KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200321193751.24985-2-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: x86: Allocate new rmap and large page tracking when moving memslotSean Christopherson
commit edd4fa37baa6ee8e44dc65523b27bd6fe44c94de upstream. Reallocate a rmap array and recalcuate large page compatibility when moving an existing memslot to correctly handle the alignment properties of the new memslot. The number of rmap entries required at each level is dependent on the alignment of the memslot's base gfn with respect to that level, e.g. moving a large-page aligned memslot so that it becomes unaligned will increase the number of rmap entries needed at the now unaligned level. Not updating the rmap array is the most obvious bug, as KVM accesses garbage data beyond the end of the rmap. KVM interprets the bad data as pointers, leading to non-canonical #GPs, unexpected #PFs, etc... general protection fault: 0000 [#1] SMP CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:rmap_get_first+0x37/0x50 [kvm] Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3 RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246 RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012 RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570 RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8 R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000 R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8 FS: 00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0 Call Trace: kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm] __kvm_set_memory_region.part.64+0x559/0x960 [kvm] kvm_set_memory_region+0x45/0x60 [kvm] kvm_vm_ioctl+0x30f/0x920 [kvm] do_vfs_ioctl+0xa1/0x620 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4c/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f7ed9911f47 Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47 RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004 RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700 R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000 R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000 Modules linked in: kvm_intel kvm irqbypass ---[ end trace 0c5f570b3358ca89 ]--- The disallow_lpage tracking is more subtle. Failure to update results in KVM creating large pages when it shouldn't, either due to stale data or again due to indexing beyond the end of the metadata arrays, which can lead to memory corruption and/or leaking data to guest/userspace. Note, the arrays for the old memslot are freed by the unconditional call to kvm_free_memslot() in __kvm_set_memory_region(). Fixes: 05da45583de9b ("KVM: MMU: large page support") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17KVM: nVMX: Properly handle userspace interrupt window requestSean Christopherson
commit a1c77abb8d93381e25a8d2df3a917388244ba776 upstream. Return true for vmx_interrupt_allowed() if the vCPU is in L2 and L1 has external interrupt exiting enabled. IRQs are never blocked in hardware if the CPU is in the guest (L2 from L1's perspective) when IRQs trigger VM-Exit. The new check percolates up to kvm_vcpu_ready_for_interrupt_injection() and thus vcpu_run(), and so KVM will exit to userspace if userspace has requested an interrupt window (to inject an IRQ into L1). Remove the @external_intr param from vmx_check_nested_events(), which is actually an indicator that userspace wants an interrupt window, e.g. it's named @req_int_win further up the stack. Injecting a VM-Exit into L1 to try and bounce out to L0 userspace is all kinds of broken and is no longer necessary. Remove the hack in nested_vmx_vmexit() that attempted to workaround the breakage in vmx_check_nested_events() by only filling interrupt info if there's an actual interrupt pending. The hack actually made things worse because it caused KVM to _never_ fill interrupt info when the LAPIC resides in userspace (kvm_cpu_has_interrupt() queries interrupt.injected, which is always cleared by prepare_vmcs12() before reaching the hack in nested_vmx_vmexit()). Fixes: 6550c4df7e50 ("KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"") Cc: stable@vger.kernel.org Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17x86/entry/32: Add missing ASM_CLAC to general_protection entryThomas Gleixner
commit 3d51507f29f2153a658df4a0674ec5b592b62085 upstream. All exception entry points must have ASM_CLAC right at the beginning. The general_protection entry is missing one. Fixes: e59d1b0a2419 ("x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200225220216.219537887@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17x86/tsc_msr: Make MSR derived TSC frequency more accurateHans de Goede
commit fac01d11722c92a186b27ee26cd429a8066adfb5 upstream. The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers" has the following table for the values from freq_desc_byt: 000B: 083.3 MHz 001B: 100.0 MHz 010B: 133.3 MHz 011B: 116.7 MHz 100B: 080.0 MHz Notice how for e.g the 83.3 MHz value there are 3 significant digits, which translates to an accuracy of a 1000 ppm, where as a typical crystal oscillator is 20 - 100 ppm, so the accuracy of the frequency format used in the Software Developer’s Manual is not really helpful. As far as we know Bay Trail SoCs use a 25 MHz crystal and Cherry Trail uses a 19.2 MHz crystal, the crystal is the source clock for a root PLL which outputs 1600 and 100 MHz. It is unclear if the root PLL outputs are used directly by the CPU clock PLL or if there is another PLL in between. This does not matter though, we can model the chain of PLLs as a single PLL with a quotient equal to the quotients of all PLLs in the chain multiplied. So we can create a simplified model of the CPU clock setup using a reference clock of 100 MHz plus a quotient which gets us as close to the frequency from the SDM as possible. For the 83.3 MHz example from above this would give 100 MHz * 5 / 6 = 83 and 1/3 MHz, which matches exactly what has been measured on actual hardware. Use a simplified PLL model with a reference clock of 100 MHz for all Bay and Cherry Trail models. This has been tested on the following models: CPU freq before: CPU freq after: Intel N2840 2165.800 MHz 2166.667 MHz Intel Z3736 1332.800 MHz 1333.333 MHz Intel Z3775 1466.300 MHz 1466.667 MHz Intel Z8350 1440.000 MHz 1440.000 MHz Intel Z8750 1600.000 MHz 1600.000 MHz This fixes the time drifting by about 1 second per hour (20 - 30 seconds per day) on (some) devices which rely on the tsc_msr.c code to determine the TSC frequency. Reported-by: Vipul Kumar <vipulk0511@gmail.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200223140610.59612-3-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>