aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2020-08-06Merge branch 'v5.6/base' into v5.6/standard/tiny/arm-versatile-926ejsv5.6/standard/tiny/arm-versatile-926ejsBruce Ashfield
2020-08-06Merge branch 'v5.6/base' into v5.6/standard/baseBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-06-17x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisonedTony Luck
commit 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 upstream. An interesting thing happened when a guest Linux instance took a machine check. The VMM unmapped the bad page from guest physical space and passed the machine check to the guest. Linux took all the normal actions to offline the page from the process that was using it. But then guest Linux crashed because it said there was a second machine check inside the kernel with this stack trace: do_memory_failure set_mce_nospec set_memory_uc _set_memory_uc change_page_attr_set_clr cpa_flush clflush_cache_range_opt This was odd, because a CLFLUSH instruction shouldn't raise a machine check (it isn't consuming the data). Further investigation showed that the VMM had passed in another machine check because is appeared that the guest was accessing the bad page. Fix is to check the scope of the poison by checking the MCi_MISC register. If the entire page is affected, then unmap the page. If only part of the page is affected, then mark the page as uncacheable. This assumes that VMMs will do the logical thing and pass in the "whole page scope" via the MCi_MISC register (since they unmapped the entire page). [ bp: Adjust to x86/entry changes. ] Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Reported-by: Jue Wang <juew@google.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17x86/reboot/quirks: Add MacBook6,1 reboot quirkHill Ma
commit 140fd4ac78d385e6c8e6a5757585f6c707085f87 upstream. On MacBook6,1 reboot would hang unless parameter reboot=pci is added. Make it automatic. Signed-off-by: Hill Ma <maahiuzeon@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200425200641.GA1554@cslab.localdomain Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches.Anthony Steinhauser
commit 4d8df8cbb9156b0a0ab3f802b80cb5db57acc0bf upstream. Currently, it is possible to enable indirect branch speculation even after it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result (force-disabled when it is in fact enabled). This also is inconsistent vs. STIBP and the documention which cleary states that PR_SPEC_FORCE_DISABLE cannot be undone. Fix this by actually enforcing force-disabled indirect branch speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails with -EPERM as described in the documentation. Fixes: 9137bb27e60e ("x86/speculation: Add prctl() control for indirect branch speculation") Signed-off-by: Anthony Steinhauser <asteinhauser@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.Anthony Steinhauser
commit 21998a351512eba4ed5969006f0c55882d995ada upstream. When STIBP is unavailable or enhanced IBRS is available, Linux force-disables the IBPB mitigation of Spectre-BTB even when simultaneous multithreading is disabled. While attempts to enable IBPB using prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent) which are used e.g. by Chromium or OpenSSH succeed with no errors but the application remains silently vulnerable to cross-process Spectre v2 attacks (classical BTB poisoning). At the same time the SYSFS reporting (/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is conditionally enabled when in fact it is unconditionally disabled. STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is unavailable, it makes no sense to force-disable also IBPB, because IBPB protects against cross-process Spectre-BTB attacks regardless of the SMT state. At the same time since missing STIBP was only observed on AMD CPUs, AMD does not recommend using STIBP, but recommends using IBPB, so disabling IBPB because of missing STIBP goes directly against AMD's advice: https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning and BTB-poisoning attacks from user space against kernel (and BTB-poisoning attacks from guest against hypervisor), it is not designed to prevent cross-process (or cross-VM) BTB poisoning between processes (or VMs) running on the same core. Therefore, even with enhanced IBRS it is necessary to flush the BTB during context-switches, so there is no reason to force disable IBPB when enhanced IBRS is available. Enable the prctl control of IBPB even when STIBP is unavailable or enhanced IBRS is available. Fixes: 7cc765a67d8e ("x86/speculation: Enable prctl mode for spectre_v2_user") Signed-off-by: Anthony Steinhauser <asteinhauser@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17x86/speculation: Prevent rogue cross-process SSBD shutdownAnthony Steinhauser
commit dbbe2ad02e9df26e372f38cc3e70dab9222c832e upstream. On context switch the change of TIF_SSBD and TIF_SPEC_IB are evaluated to adjust the mitigations accordingly. This is optimized to avoid the expensive MSR write if not needed. This optimization is buggy and allows an attacker to shutdown the SSBD protection of a victim process. The update logic reads the cached base value for the speculation control MSR which has neither the SSBD nor the STIBP bit set. It then OR's the SSBD bit only when TIF_SSBD is different and requests the MSR update. That means if TIF_SSBD of the previous and next task are the same, then the base value is not updated, even if TIF_SSBD is set. The MSR write is not requested. Subsequently if the TIF_STIBP bit differs then the STIBP bit is updated in the base value and the MSR is written with a wrong SSBD value. This was introduced when the per task/process conditional STIPB switching was added on top of the existing SSBD switching. It is exploitable if the attacker creates a process which enforces SSBD and has the contrary value of STIBP than the victim process (i.e. if the victim process enforces STIBP, the attacker process must not enforce it; if the victim process does not enforce STIBP, the attacker process must enforce it) and schedule it on the same core as the victim process. If the victim runs after the attacker the victim becomes vulnerable to Spectre V4. To fix this, update the MSR value independent of the TIF_SSBD difference and dependent on the SSBD mitigation method available. This ensures that a subsequent STIPB initiated MSR write has the correct state of SSBD. [ tglx: Handle X86_FEATURE_VIRT_SSBD & X86_FEATURE_VIRT_SSBD correctly and massaged changelog ] Fixes: 5bfbe3ad5840 ("x86/speculation: Prepare for per task indirect branch speculation control") Signed-off-by: Anthony Steinhauser <asteinhauser@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17x86_64: Fix jiffies ODR violationBob Haarman
commit d8ad6d39c35d2b44b3d48b787df7f3359381dcbf upstream. 'jiffies' and 'jiffies_64' are meant to alias (two different symbols that share the same address). Most architectures make the symbols alias to the same address via a linker script assignment in their arch/<arch>/kernel/vmlinux.lds.S: jiffies = jiffies_64; which is effectively a definition of jiffies. jiffies and jiffies_64 are both forward declared for all architectures in include/linux/jiffies.h. jiffies_64 is defined in kernel/time/timer.c. x86_64 was peculiar in that it wasn't doing the above linker script assignment, but rather was: 1. defining jiffies in arch/x86/kernel/time.c instead via the linker script. 2. overriding the symbol jiffies_64 from kernel/time/timer.c in arch/x86/kernel/vmlinux.lds.s via 'jiffies_64 = jiffies;'. As Fangrui notes: In LLD, symbol assignments in linker scripts override definitions in object files. GNU ld appears to have the same behavior. It would probably make sense for LLD to error "duplicate symbol" but GNU ld is unlikely to adopt for compatibility reasons. This results in an ODR violation (UB), which seems to have survived thus far. Where it becomes harmful is when; 1. -fno-semantic-interposition is used: As Fangrui notes: Clang after LLVM commit 5b22bcc2b70d ("[X86][ELF] Prefer to lower MC_GlobalAddress operands to .Lfoo$local") defaults to -fno-semantic-interposition similar semantics which help -fpic/-fPIC code avoid GOT/PLT when the referenced symbol is defined within the same translation unit. Unlike GCC -fno-semantic-interposition, Clang emits such relocations referencing local symbols for non-pic code as well. This causes references to jiffies to refer to '.Ljiffies$local' when jiffies is defined in the same translation unit. Likewise, references to jiffies_64 become references to '.Ljiffies_64$local' in translation units that define jiffies_64. Because these differ from the names used in the linker script, they will not be rewritten to alias one another. 2. Full LTO Full LTO effectively treats all source files as one translation unit, causing these local references to be produced everywhere. When the linker processes the linker script, there are no longer any references to jiffies_64' anywhere to replace with 'jiffies'. And thus '.Ljiffies$local' and '.Ljiffies_64$local' no longer alias at all. In the process of porting patches enabling Full LTO from arm64 to x86_64, spooky bugs have been observed where the kernel appeared to boot, but init doesn't get scheduled. Avoid the ODR violation by matching other architectures and define jiffies only by linker script. For -fno-semantic-interposition + Full LTO, there is no longer a global definition of jiffies for the compiler to produce a local symbol which the linker script won't ensure aliases to jiffies_64. Fixes: 40747ffa5aa8 ("asmlinkage: Make jiffies visible") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: Alistair Delva <adelva@google.com> Debugged-by: Nick Desaulniers <ndesaulniers@google.com> Debugged-by: Sami Tolvanen <samitolvanen@google.com> Suggested-by: Fangrui Song <maskray@google.com> Signed-off-by: Bob Haarman <inglorion@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # build+boot on Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: stable@vger.kernel.org Link: https://github.com/ClangBuiltLinux/linux/issues/852 Link: https://lkml.kernel.org/r/20200602193100.229287-1-inglorion@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-17x86/cpu/amd: Make erratum #1054 a legacy erratumKim Phillips
[ Upstream commit e2abfc0448a46d8a137505aa180caf14070ec535 ] Commit 21b5ee59ef18 ("x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF") mistakenly added erratum #1054 as an OS Visible Workaround (OSVW) ID 0. Erratum #1054 is not OSVW ID 0 [1], so make it a legacy erratum. There would never have been a false positive on older hardware that has OSVW bit 0 set, since the IRPERF feature was not available. However, save a couple of RDMSR executions per thread, on modern system configurations that correctly set non-zero values in their OSVW_ID_Length MSRs. [1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors. The revision guide is available from the bugzilla link below. Fixes: 21b5ee59ef18 ("x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200417143356.26054-1-kim.phillips@amd.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-10x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigationMark Gross
commit 7e5b3c267d256822407a22fdce6afdf9cd13f9fb upstream SRBDS is an MDS-like speculative side channel that can leak bits from the random number generator (RNG) across cores and threads. New microcode serializes the processor access during the execution of RDRAND and RDSEED. This ensures that the shared buffer is overwritten before it is released for reuse. While it is present on all affected CPU models, the microcode mitigation is not needed on models that enumerate ARCH_CAPABILITIES[MDS_NO] in the cases where TSX is not supported or has been disabled with TSX_CTRL. The mitigation is activated by default on affected processors and it increases latency for RDRAND and RDSEED instructions. Among other effects this will reduce throughput from /dev/urandom. * Enable administrator to configure the mitigation off when desired using either mitigations=off or srbds=off. * Export vulnerability status via sysfs * Rename file-scoped macros to apply for non-whitelist table initializations. [ bp: Massage, - s/VULNBL_INTEL_STEPPING/VULNBL_INTEL_STEPPINGS/g, - do not read arch cap MSR a second time in tsx_fused_off() - just pass it in, - flip check in cpu_set_bug_bits() to save an indentation level, - reflow comments. jpoimboe: s/Mitigated/Mitigation/ in user-visible strings tglx: Dropped the fused off magic for now ] Signed-off-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-10x86/cpu: Add 'table' argument to cpu_matches()Mark Gross
commit 93920f61c2ad7edb01e63323832585796af75fc9 upstream To make cpu_matches() reusable for other matching tables, have it take a pointer to a x86_cpu_id table as an argument. [ bp: Flip arguments order. ] Signed-off-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-10x86/cpu: Add a steppings field to struct x86_cpu_idMark Gross
commit e9d7144597b10ff13ff2264c059f7d4a7fbc89ac upstream Intel uses the same family/model for several CPUs. Sometimes the stepping must be checked to tell them apart. On x86 there can be at most 16 steppings. Add a steppings bitmask to x86_cpu_id and a X86_MATCH_VENDOR_FAMILY_MODEL_STEPPING_FEATURE macro and support for matching against family/model/stepping. [ bp: Massage. tglx: Lightweight variant for backporting ] Signed-off-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-03x86/ioperm: Prevent a memory leak when fork failsJay Lang
commit 4bfe6cce133cad82cea04490c308795275857782 upstream. In the copy_process() routine called by _do_fork(), failure to allocate a PID (or further along in the function) will trigger an invocation to exit_thread(). This is done to clean up from an earlier call to copy_thread_tls(). Naturally, the child task is passed into exit_thread(), however during the process, io_bitmap_exit() nullifies the parent's io_bitmap rather than the child's. As copy_thread_tls() has been called ahead of the failure, the reference count on the calling thread's io_bitmap is incremented as we would expect. However, io_bitmap_exit() doesn't accept any arguments, and thus assumes it should trash the current thread's io_bitmap reference rather than the child's. This is pretty sneaky in practice, because in all instances but this one, exit_thread() is called with respect to the current task and everything works out. A determined attacker can issue an appropriate ioctl (i.e. KDENABIO) to get a bitmap allocated, and force a clone3() syscall to fail by passing in a zeroed clone_args structure. The kernel handles the erroneous struct and the buggy code path is followed, and even though the parent's reference to the io_bitmap is trashed, the child still holds a reference and thus the structure will never be freed. Fix this by tweaking io_bitmap_exit() and its subroutines to accept a task_struct argument which to operate on. Fixes: ea5f1cd7ab49 ("x86/ioperm: Remove bitmap if all permissions dropped") Signed-off-by: Jay Lang <jaytlang@mit.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable#@vger.kernel.org Link: https://lkml.kernel.org/r/20200524162742.253727-1-jaytlang@mit.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-03copy_xstate_to_kernel(): don't leave parts of destination uninitializedAl Viro
commit 9e4636545933131de15e1ecd06733538ae939b2f upstream. copy the corresponding pieces of init_fpstate into the gaps instead. Cc: stable@kernel.org Tested-by: Alexander Potapenko <glider@google.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasksJosh Poimboeuf
commit 187b96db5ca79423618dfa29a05c438c34f9e1f0 upstream. Normally, show_trace_log_lvl() scans the stack, looking for text addresses to print. In parallel, it unwinds the stack with unwind_next_frame(). If the stack address matches the pointer returned by unwind_get_return_address_ptr() for the current frame, the text address is printed normally without a question mark. Otherwise it's considered a breadcrumb (potentially from a previous call path) and it's printed with a question mark to indicate that the address is unreliable and typically can be ignored. Since the following commit: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") ... for inactive tasks, show_trace_log_lvl() prints *only* unreliable addresses (prepended with '?'). That happens because, for the first frame of an inactive task, unwind_get_return_address_ptr() returns the wrong return address pointer: one word *below* the task stack pointer. show_trace_log_lvl() starts scanning at the stack pointer itself, so it never finds the first 'reliable' address, causing only guesses to being printed. The first frame of an inactive task isn't a normal stack frame. It's actually just an instance of 'struct inactive_task_frame' which is left behind by __switch_to_asm(). Now that this inactive frame is actually exposed to callers, fix unwind_get_return_address_ptr() to interpret it properly. Fixes: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200522135435.vbxs7umku5pyrdbk@treble Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27x86/apic: Move TSC deadline timer debug printkThomas Gleixner
[ Upstream commit c84cb3735fd53c91101ccdb191f2e3331a9262cb ] Leon reported that the printk_once() in __setup_APIC_LVTT() triggers a lockdep splat due to a lock order violation between hrtimer_base::lock and console_sem, when the 'once' condition is reset via /sys/kernel/debug/clear_warn_once after boot. The initial printk cannot trigger this because that happens during boot when the local APIC timer is set up on the boot CPU. Prevent it by moving the printk to a place which is guaranteed to be only called once during boot. Mark the deadline timer check related functions and data __init while at it. Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87y2qhoshi.fsf@nanos.tec.linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20x86/unwind/orc: Fix error handling in __unwind_start()Josh Poimboeuf
commit 71c95825289f585014fe9741b051d32a7a916680 upstream. The unwind_state 'error' field is used to inform the reliable unwinding code that the stack trace can't be trusted. Set this field for all errors in __unwind_start(). Also, move the zeroing out of the unwind_state struct to before the ORC table initialization check, to prevent the caller from reading uninitialized data if the ORC table is corrupted. Fixes: af085d9084b4 ("stacktrace/x86: add function for detecting reliable stack traces") Fixes: d3a09104018c ("x86/unwinder/orc: Dont bail on stack overflow") Fixes: 98d0c8ebf77e ("x86/unwind/orc: Prevent unwinding before ORC initialization") Reported-by: Pavel Machek <pavel@denx.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/d6ac7215a84ca92b895fdd2e1aa546729417e6e6.1589487277.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20x86: Fix early boot crash on gcc-10, third tryBorislav Petkov
commit a9a3ed1eff3601b63aea4fb462d8b3b92c7c1e7e upstream. ... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kalle Valo <kvalo@codeaurora.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot upSteven Rostedt (VMware)
[ Upstream commit 59566b0b622e3e6ea928c0b8cac8a5601b00b383 ] Booting one of my machines, it triggered the following crash: Kernel/User page tables isolation: enabled ftrace: allocating 36577 entries in 143 pages Starting tracer 'function' BUG: unable to handle page fault for address: ffffffffa000005c #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 RIP: 0010:text_poke_early+0x4a/0x58 Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff 0 41 57 49 RSP: 0000:ffffffff82003d38 EFLAGS: 00010046 RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005 RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0 R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0 R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0 FS: 0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0 Call Trace: text_poke_bp+0x27/0x64 ? mutex_lock+0x36/0x5d arch_ftrace_update_trampoline+0x287/0x2d5 ? ftrace_replace_code+0x14b/0x160 ? ftrace_update_ftrace_func+0x65/0x6c __register_ftrace_function+0x6d/0x81 ftrace_startup+0x23/0xc1 register_ftrace_function+0x20/0x37 func_set_flag+0x59/0x77 __set_tracer_option.isra.19+0x20/0x3e trace_set_options+0xd6/0x13e apply_trace_boot_options+0x44/0x6d register_tracer+0x19e/0x1ac early_trace_init+0x21b/0x2c9 start_kernel+0x241/0x518 ? load_ucode_intel_bsp+0x21/0x52 secondary_startup_64+0xa4/0xb0 I was able to trigger it on other machines, when I added to the kernel command line of both "ftrace=function" and "trace_options=func_stack_trace". The cause is the "ftrace=function" would register the function tracer and create a trampoline, and it will set it as executable and read-only. Then the "trace_options=func_stack_trace" would then update the same trampoline to include the stack tracer version of the function tracer. But since the trampoline already exists, it updates it with text_poke_bp(). The problem is that text_poke_bp() called while system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not the page mapping, as it would think that the text is still read-write. But in this case it is not, and we take a fault and crash. Instead, lets keep the ftrace trampolines read-write during boot up, and then when the kernel executable text is set to read-only, the ftrace trampolines get set to read-only as well. Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: stable@vger.kernel.org Fixes: 768ae4406a5c ("x86/ftrace: Use text_poke()") Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-14x86/unwind/orc: Fix premature unwind stoppage due to IRET framesJosh Poimboeuf
commit 81b67439d147677d844d492fcbd03712ea438f42 upstream. The following execution path is possible: fsnotify() [ realign the stack and store previous SP in R10 ] <IRQ> [ only IRET regs saved ] common_interrupt() interrupt_entry() <NMI> [ full pt_regs saved ] ... [ unwind stack ] When the unwinder goes through the NMI and the IRQ on the stack, and then sees fsnotify(), it doesn't have access to the value of R10, because it only has the five IRET registers. So the unwind stops prematurely. However, because the interrupt_entry() code is careful not to clobber R10 before saving the full regs, the unwinder should be able to read R10 from the previously saved full pt_regs associated with the NMI. Handle this case properly. When encountering an IRET regs frame immediately after a full pt_regs frame, use the pt_regs as a backup which can be used to get the C register values. Also, note that a call frame resets the 'prev_regs' value, because a function is free to clobber the registers. For this fix to work, the IRET and full regs frames must be adjacent, with no FUNC frames in between. So replace the FUNC hint in interrupt_entry() with an IRET_REGS hint. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/97a408167cc09f1cfa0de31a7b70dd88868d743f.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/unwind/orc: Fix error path for bad ORC entry typeJosh Poimboeuf
commit a0f81bf26888048100bf017fadf438a5bdffa8d8 upstream. If the ORC entry type is unknown, nothing else can be done other than reporting an error. Exit the function instead of breaking out of the switch statement. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/a7fa668ca6eabbe81ab18b2424f15adbbfdc810a.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/unwind/orc: Prevent unwinding before ORC initializationJosh Poimboeuf
commit 98d0c8ebf77e0ba7c54a9ae05ea588f0e9e3f46e upstream. If the unwinder is called before the ORC data has been initialized, orc_find() returns NULL, and it tries to fall back to using frame pointers. This can cause some unexpected warnings during boot. Move the 'orc_init' check from orc_find() to __unwind_init(), so that it doesn't even try to unwind from an uninitialized state. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/069d1499ad606d85532eb32ce39b2441679667d5.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-14x86/unwind/orc: Don't skip the first frame for inactive tasksMiroslav Benes
commit f1d9a2abff66aa8156fbc1493abed468db63ea48 upstream. When unwinding an inactive task, the ORC unwinder skips the first frame by default. If both the 'regs' and 'first_frame' parameters of unwind_start() are NULL, 'state->sp' and 'first_frame' are later initialized to the same value for an inactive task. Given there is a "less than or equal to" comparison used at the end of __unwind_start() for skipping stack frames, the first frame is skipped. Drop the equal part of the comparison and make the behavior equivalent to the frame pointer unwinder. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Jones <dsj@fb.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lore.kernel.org/r/7f08db872ab59e807016910acdbe82f744de7065.1587808742.git.jpoimboe@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02x86: hyperv: report value of misc_featuresOlaf Hering
[ Upstream commit 97d9f1c43bedd400301d6f1eff54d46e8c636e47 ] A few kernel features depend on ms_hyperv.misc_features, but unlike its siblings ->features and ->hints, the value was never reported during boot. Signed-off-by: Olaf Hering <olaf@aepfle.de> Link: https://lore.kernel.org/r/20200407172739.31371-1-olaf@aepfle.de Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23x86: ACPI: fix CPU hotplug deadlockQian Cai
[ Upstream commit 696ac2e3bf267f5a2b2ed7d34e64131f2287d0ad ] Similar to commit 0266d81e9bf5 ("acpi/processor: Prevent cpu hotplug deadlock") except this is for acpi_processor_ffh_cstate_probe(): "The problem is that the work is scheduled on the current CPU from the hotplug thread associated with that CPU. It's not required to invoke these functions via the workqueue because the hotplug thread runs on the target CPU already. Check whether current is a per cpu thread pinned on the target CPU and invoke the function directly to avoid the workqueue." WARNING: possible circular locking dependency detected ------------------------------------------------------ cpuhp/1/15 is trying to acquire lock: ffffc90003447a28 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: __flush_work+0x4c6/0x630 but task is already holding lock: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x3e/0xc0 irq_calc_affinity_vectors+0x5f/0x91 __pci_enable_msix_range+0x10f/0x9a0 pci_alloc_irq_vectors_affinity+0x13e/0x1f0 pci_alloc_irq_vectors_affinity at drivers/pci/msi.c:1208 pqi_ctrl_init+0x72f/0x1618 [smartpqi] pqi_pci_probe.cold.63+0x882/0x892 [smartpqi] local_pci_probe+0x7a/0xc0 work_for_cpu_fn+0x2e/0x50 process_one_work+0x57e/0xb90 worker_thread+0x363/0x5b0 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 -> #0 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __lock_acquire+0x2244/0x32a0 lock_acquire+0x1a2/0x680 __flush_work+0x4e6/0x630 work_on_cpu+0x114/0x160 acpi_processor_ffh_cstate_probe+0x129/0x250 acpi_processor_evaluate_cst+0x4c8/0x580 acpi_processor_get_power_info+0x86/0x740 acpi_processor_hotplug+0xc3/0x140 acpi_soft_cpu_online+0x102/0x1d0 cpuhp_invoke_callback+0x197/0x1120 cpuhp_thread_fun+0x252/0x2f0 smpboot_thread_fn+0x255/0x440 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 other info that might help us debug this: Chain exists of: (work_completion)(&wfc.work) --> cpuhp_state-up --> cpuidle_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuidle_lock); lock(cpuhp_state-up); lock(cpuidle_lock); lock((work_completion)(&wfc.work)); *** DEADLOCK *** 3 locks held by cpuhp/1/15: #0: ffffffffaf51ab10 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0 #1: ffffffffaf51ad40 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x69/0x2f0 #2: ffffffffafa1c0e8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x17/0x20 Call Trace: dump_stack+0xa0/0xea print_circular_bug.cold.52+0x147/0x14c check_noncircular+0x295/0x2d0 __lock_acquire+0x2244/0x32a0 lock_acquire+0x1a2/0x680 __flush_work+0x4e6/0x630 work_on_cpu+0x114/0x160 acpi_processor_ffh_cstate_probe+0x129/0x250 acpi_processor_evaluate_cst+0x4c8/0x580 acpi_processor_get_power_info+0x86/0x740 acpi_processor_hotplug+0xc3/0x140 acpi_soft_cpu_online+0x102/0x1d0 cpuhp_invoke_callback+0x197/0x1120 cpuhp_thread_fun+0x252/0x2f0 smpboot_thread_fn+0x255/0x440 kthread+0x1f4/0x220 ret_from_fork+0x27/0x50 Signed-off-by: Qian Cai <cai@lca.pw> Tested-by: Borislav Petkov <bp@suse.de> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23x86/Hyper-V: Report crash register data or kmsg before running crash kernelTianyu Lan
commit a11589563e96bf262767294b89b25a9d44e7303b upstream. We want to notify Hyper-V when a Linux guest VM crash occurs, so there is a record of the crash even when kdump is enabled. But crash_kexec_post_notifiers defaults to "false", so the kdump kernel runs before the notifiers and Hyper-V never gets notified. Fix this by always setting crash_kexec_post_notifiers to be true for Hyper-V VMs. Fixes: 81b18bce48af ("Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic") Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Link: https://lore.kernel.org/r/20200406155331.2105-5-Tianyu.Lan@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21x86/resctrl: Fix invalid attempt at removing the default resource groupReinette Chatre
commit b0151da52a6d4f3951ea24c083e7a95977621436 upstream. The default resource group ("rdtgroup_default") is associated with the root of the resctrl filesystem and should never be removed. New resource groups can be created as subdirectories of the resctrl filesystem and they can be removed from user space. There exists a safeguard in the directory removal code (rdtgroup_rmdir()) that ensures that only subdirectories can be removed by testing that the directory to be removed has to be a child of the root directory. A possible deadlock was recently fixed with 334b0f4e9b1b ("x86/resctrl: Fix a deadlock due to inaccurate reference"). This fix involved associating the private data of the "mon_groups" and "mon_data" directories to the resource group to which they belong instead of NULL as before. A consequence of this change was that the original safeguard code preventing removal of "mon_groups" and "mon_data" found in the root directory failed resulting in attempts to remove the default resource group that ends in a BUG: kernel BUG at mm/slub.c:3969! invalid opcode: 0000 [#1] SMP PTI Call Trace: rdtgroup_rmdir+0x16b/0x2c0 kernfs_iop_rmdir+0x5c/0x90 vfs_rmdir+0x7a/0x160 do_rmdir+0x17d/0x1e0 do_syscall_64+0x55/0x1d0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by improving the directory removal safeguard to ensure that subdirectories of the resctrl root directory can only be removed if they are a child of the resctrl filesystem's root _and_ not associated with the default resource group. Fixes: 334b0f4e9b1b ("x86/resctrl: Fix a deadlock due to inaccurate reference") Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/884cbe1773496b5dbec1b6bd11bb50cffa83603d.1584461853.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21x86/resctrl: Preserve CDP enable over CPU hotplugJames Morse
commit 9fe0450785abbc04b0ed5d3cf61fcdb8ab656b4b upstream. Resctrl assumes that all CPUs are online when the filesystem is mounted, and that CPUs remember their CDP-enabled state over CPU hotplug. This goes wrong when resctrl's CDP-enabled state changes while all the CPUs in a domain are offline. When a domain comes online, enable (or disable!) CDP to match resctrl's current setting. Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support") Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200221162105.154163-1-james.morse@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17x86/tsc_msr: Make MSR derived TSC frequency more accurateHans de Goede
commit fac01d11722c92a186b27ee26cd429a8066adfb5 upstream. The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers" has the following table for the values from freq_desc_byt: 000B: 083.3 MHz 001B: 100.0 MHz 010B: 133.3 MHz 011B: 116.7 MHz 100B: 080.0 MHz Notice how for e.g the 83.3 MHz value there are 3 significant digits, which translates to an accuracy of a 1000 ppm, where as a typical crystal oscillator is 20 - 100 ppm, so the accuracy of the frequency format used in the Software Developer’s Manual is not really helpful. As far as we know Bay Trail SoCs use a 25 MHz crystal and Cherry Trail uses a 19.2 MHz crystal, the crystal is the source clock for a root PLL which outputs 1600 and 100 MHz. It is unclear if the root PLL outputs are used directly by the CPU clock PLL or if there is another PLL in between. This does not matter though, we can model the chain of PLLs as a single PLL with a quotient equal to the quotients of all PLLs in the chain multiplied. So we can create a simplified model of the CPU clock setup using a reference clock of 100 MHz plus a quotient which gets us as close to the frequency from the SDM as possible. For the 83.3 MHz example from above this would give 100 MHz * 5 / 6 = 83 and 1/3 MHz, which matches exactly what has been measured on actual hardware. Use a simplified PLL model with a reference clock of 100 MHz for all Bay and Cherry Trail models. This has been tested on the following models: CPU freq before: CPU freq after: Intel N2840 2165.800 MHz 2166.667 MHz Intel Z3736 1332.800 MHz 1333.333 MHz Intel Z3775 1466.300 MHz 1466.667 MHz Intel Z8350 1440.000 MHz 1440.000 MHz Intel Z8750 1600.000 MHz 1600.000 MHz This fixes the time drifting by about 1 second per hour (20 - 30 seconds per day) on (some) devices which rely on the tsc_msr.c code to determine the TSC frequency. Reported-by: Vipul Kumar <vipulk0511@gmail.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200223140610.59612-3-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17x86/tsc_msr: Fix MSR_FSB_FREQ mask for Cherry Trail devicesHans de Goede
commit c8810e2ffc30c7e1577f9c057c4b85d984bbc35a upstream. According to the "Intel 64 and IA-32 Architectures Software Developer's Manual Volume 4: Model-Specific Registers" on Cherry Trail (Airmont) devices the 4 lowest bits of the MSR_FSB_FREQ mask indicate the bus freq unlike on e.g. Bay Trail where only the lowest 3 bits are used. This is also the reason why MAX_NUM_FREQS is defined as 9, since Cherry Trail SoCs have 9 possible frequencies, so the lo value from the MSR needs to be masked with 0x0f, not with 0x07 otherwise the 9th frequency will get interpreted as the 1st. Bump MAX_NUM_FREQS to 16 to avoid any possibility of addressing the array out of bounds and makes the mask part of the cpufreq struct so it can be set it per model. While at it also log an error when the index points to an uninitialized part of the freqs lookup-table. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200223140610.59612-2-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17x86/tsc_msr: Use named struct initializersHans de Goede
commit 812c2d7506fde7cdf83cb2532810a65782b51741 upstream. Use named struct initializers for the freq_desc struct-s initialization and change the "u8 msr_plat" to a "bool use_msr_plat" to make its meaning more clear instead of relying on a comment to explain it. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200223140610.59612-1-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17acpi/x86: ignore unspecified bit positions in the ACPI global lock fieldJan Engelhardt
commit ecb9c790999fd6c5af0f44783bd0217f0b89ec2b upstream. The value in "new" is constructed from "old" such that all bits defined as reserved by the ACPI spec[1] are left untouched. But if those bits do not happen to be all zero, "new < 3" will not evaluate to true. The firmware of the laptop(s) Medion MD63490 / Akoya P15648 comes with garbage inside the "FACS" ACPI table. The starting value is old=0x4944454d, therefore new=0x4944454e, which is >= 3. Mask off the reserved bits. [1] https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf Link: https://bugzilla.kernel.org/show_bug.cgi?id=206553 Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-16Merge tag 'v5.6-rc6' into standard/baseBruce Ashfield
Linux 5.6-rc6
2020-03-15Merge tag 'x86-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Two fixes for x86: - Map EFI runtime service data as encrypted when SEV is enabled. Otherwise e.g. SMBIOS data cannot be properly decoded by dmidecode. - Remove the warning in the vector management code which triggered when a managed interrupt affinity changed outside of a CPU hotplug operation. The warning was correct until the recent core code change that introduced a CPU isolation feature which needs to migrate managed interrupts away from online CPUs under certain conditions to achieve the isolation" * tag 'x86-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vector: Remove warning on managed interrupt migration x86/ioremap: Map EFI runtime services data as encrypted for SEV
2020-03-15Merge tag 'ras-urgent-2020-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixes from Thomas Gleixner: "Two RAS related fixes: - Shut down the per CPU thermal throttling poll work properly when a CPU goes offline. The missing shutdown caused the poll work to be migrated to a unbound worker which triggered warnings about the usage of smp_processor_id() in preemptible context - Fix the PPIN feature initialization which missed to enable the functionality when PPIN_CTL was enabled but the MSR locked against updates" * tag 'ras-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Fix logic and comments around MSR_PPIN_CTL x86/mce/therm_throt: Undo thermal polling properly on CPU offline
2020-03-13x86/vector: Remove warning on managed interrupt migrationPeter Xu
The vector management code assumes that managed interrupts cannot be migrated away from an online CPU. free_moved_vector() has a WARN_ON_ONCE() which triggers when a managed interrupt vector association on a online CPU is cleared. The CPU offline code uses a different mechanism which cannot trigger this. This assumption is not longer correct because the new CPU isolation feature which affects the placement of managed interrupts must be able to move a managed interrupt away from an online CPU. There are two reasons why this can happen: 1) When the interrupt is activated the affinity mask which was established in irq_create_affinity_masks() is handed in to the vector allocation code. This mask contains all CPUs to which the interrupt can be made affine to, but this does not take the CPU isolation 'managed_irq' mask into account. When the interrupt is finally requested by the device driver then the affinity is checked again and the CPU isolation 'managed_irq' mask is taken into account, which moves the interrupt to a non-isolated CPU if possible. 2) The interrupt can be affine to an isolated CPU because the non-isolated CPUs in the calculated affinity mask are not online. Once a non-isolated CPU which is in the mask comes online the interrupt is migrated to this non-isolated CPU In both cases the regular online migration mechanism is used which triggers the WARN_ON_ONCE() in free_moved_vector(). Case #1 could have been addressed by taking the isolation mask into account, but that would require a massive code change in the activation logic and the eventual migration event was accepted as a reasonable tradeoff when the isolation feature was developed. But even if #1 would be addressed, #2 would still trigger it. Of course the warning in free_moved_vector() was overlooked at that time and the above two cases which have been discussed during patch review have obviously never been tested before the final submission. So keep it simple and remove the warning. [ tglx: Rewrote changelog and added a comment to free_moved_vector() ] Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts") Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lkml.kernel.org/r/20200312205830.81796-1-peterx@redhat.com
2020-03-11x86/mce: Add compat_ioctl assignment to make it compatible with 32-bit systemHe Zhe
32-bit user-space program would get errors like the following from ioctl syscall due to missing compat_ioctl. MCE_GET_RECORD_LEN: Inappropriate ioctl for device compat_ptr_ioctl is provided as a generic implementation of .compat_ioctl file operation to ioctl functions that either ignore the argument or pass a pointer to a compatible data type. Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-03-02Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: a pkeys fix for a bug that triggers with weird BIOS settings, and two Xen PV fixes: a paravirt interface fix, and pagetable dumping fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Fix dump_pagetables with Xen PV x86/ioperm: Add new paravirt function update_io_bitmap() x86/pkeys: Manually set X86_FEATURE_OSPKE to preserve existing changes
2020-03-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "More bugfixes, including a few remaining "make W=1" issues such as too large frame sizes on some configurations. On the ARM side, the compiler was messing up shadow stacks between EL1 and EL2 code, which is easily fixed with __always_inline" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: check descriptor table exits on instruction emulation kvm: x86: Limit the number of "kvm: disabled by bios" messages KVM: x86: avoid useless copy of cpufreq policy KVM: allow disabling -Werror KVM: x86: allow compiling as non-module with W=1 KVM: Pre-allocate 1 cpumask variable per cpu for both pv tlb and pv ipis KVM: Introduce pv check helpers KVM: let declaration of kvm_get_running_vcpus match implementation KVM: SVM: allocate AVIC data structures based on kvm_amd module parameter arm64: Ask the compiler to __always_inline functions used by KVM at HYP KVM: arm64: Define our own swab32() to avoid a uapi static inline KVM: arm64: Ask the compiler to __always_inline functions used at HYP kvm: arm/arm64: Fold VHE entry/exit work into kvm_vcpu_run_vhe() KVM: arm/arm64: Fix up includes for trace.h
2020-02-29x86/ioperm: Add new paravirt function update_io_bitmap()Juergen Gross
Commit 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as well") reworked the iopl syscall to use I/O bitmaps. Unfortunately this broke Xen PV domains using that syscall as there is currently no I/O bitmap support in PV domains. Add I/O bitmap support via a new paravirt function update_io_bitmap which Xen PV domains can use to update their I/O bitmaps via a hypercall. Fixes: 111e7b15cf10f6 ("x86/ioperm: Extend IOPL config to control ioperm() as well") Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Cc: <stable@vger.kernel.org> # 5.5 Link: https://lkml.kernel.org/r/20200218154712.25490-1-jgross@suse.com
2020-02-28KVM: Pre-allocate 1 cpumask variable per cpu for both pv tlb and pv ipisWanpeng Li
Nick Desaulniers Reported: When building with: $ make CC=clang arch/x86/ CFLAGS=-Wframe-larger-than=1000 The following warning is observed: arch/x86/kernel/kvm.c:494:13: warning: stack frame size of 1064 bytes in function 'kvm_send_ipi_mask_allbutself' [-Wframe-larger-than=] static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int vector) ^ Debugging with: https://github.com/ClangBuiltLinux/frame-larger-than via: $ python3 frame_larger_than.py arch/x86/kernel/kvm.o \ kvm_send_ipi_mask_allbutself points to the stack allocated `struct cpumask newmask` in `kvm_send_ipi_mask_allbutself`. The size of a `struct cpumask` is potentially large, as it's CONFIG_NR_CPUS divided by BITS_PER_LONG for the target architecture. CONFIG_NR_CPUS for X86_64 can be as high as 8192, making a single instance of a `struct cpumask` 1024 B. This patch fixes it by pre-allocate 1 cpumask variable per cpu and use it for both pv tlb and pv ipis.. Reported-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-28KVM: Introduce pv check helpersWanpeng Li
Introduce some pv check helpers for consistency. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-27x86/mce: Fix logic and comments around MSR_PPIN_CTLTony Luck
There are two implemented bits in the PPIN_CTL MSR: Bit 0: LockOut (R/WO) Set 1 to prevent further writes to MSR_PPIN_CTL. Bit 1: Enable_PPIN (R/W) If 1, enables MSR_PPIN to be accessible using RDMSR. If 0, an attempt to read MSR_PPIN will cause #GP. So there are four defined values: 0: PPIN is disabled, PPIN_CTL may be updated 1: PPIN is disabled. PPIN_CTL is locked against updates 2: PPIN is enabled. PPIN_CTL may be updated 3: PPIN is enabled. PPIN_CTL is locked against updates Code would only enable the X86_FEATURE_INTEL_PPIN feature for case "2". When it should have done so for both case "2" and case "3". Fix the final test to just check for the enable bit. Also fix some of the other comments in this function. Fixes: 3f5a7896a509 ("x86/mce: Include the PPIN in MCE records when available") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200226011737.9958-1-tony.luck@intel.com
2020-02-27x86/pkeys: Manually set X86_FEATURE_OSPKE to preserve existing changesSean Christopherson
Explicitly set X86_FEATURE_OSPKE via set_cpu_cap() instead of calling get_cpu_cap() to pull the feature bit from CPUID after enabling CR4.PKE. Invoking get_cpu_cap() effectively wipes out any {set,clear}_cpu_cap() changes that were made between this_cpu->c_init() and setup_pku(), as all non-synthetic feature words are reinitialized from the CPU's CPUID values. Blasting away capability updates manifests most visibility when running on a VMX capable CPU, but with VMX disabled by BIOS. To indicate that VMX is disabled, init_ia32_feat_ctl() clears X86_FEATURE_VMX, using clear_cpu_cap() instead of setup_clear_cpu_cap() so that KVM can report which CPU is misconfigured (KVM needs to probe every CPU anyways). Restoring X86_FEATURE_VMX from CPUID causes KVM to think VMX is enabled, ultimately leading to an unexpected #GP when KVM attempts to do VMXON. Arguably, init_ia32_feat_ctl() should use setup_clear_cpu_cap() and let KVM figure out a different way to report the misconfigured CPU, but VMX is not the only feature bit that is affected, i.e. there is precedent that tweaking feature bits via {set,clear}_cpu_cap() after ->c_init() is expected to work. Most notably, x86_init_rdrand()'s clearing of X86_FEATURE_RDRAND when RDRAND malfunctions is also overwritten. Fixes: 0697694564c8 ("x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU") Reported-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Jacob Keller <jacob.e.keller@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200226231615.13664-1-sean.j.christopherson@intel.com
2020-02-25x86/mce/therm_throt: Undo thermal polling properly on CPU offlineThomas Gleixner
Chris Wilson reported splats from running the thermal throttling workqueue callback on offlined CPUs. The problem is that that callback should not even run on offlined CPUs but it happens nevertheless because the offlining callback thermal_throttle_offline() does not symmetrically undo the setup work done in its onlining counterpart. IOW, 1. The thermal interrupt vector should be masked out before ... 2. ... cancelling any pending work synchronously so that no new work is enqueued anymore. Do those things and fix the issue properly. [ bp: Write commit message. ] Fixes: f6656208f04e ("x86/mce/therm_throt: Optimize notifications of thermal throttle") Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Pandruvada, Srinivas <srinivas.pandruvada@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/158120068234.18291.7938335950259651295@skylake-alporthouse-com
2020-02-22Merge tag 'ras-urgent-2020-02-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixes from Thomas Gleixner: "Two fixes for the AMD MCE driver: - Populate the per CPU MCA bank descriptor pointer only after it has been completely set up to prevent a use-after-free in case that one of the subsequent initialization step fails - Implement a proper release function for the sysfs entries of MCA threshold controls instead of freeing the memory right in the CPU teardown code, which leads to another use-after-free when the associated sysfs file is opened and accessed" * tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce/amd: Fix kobject lifetime x86/mce/amd: Publish the bank pointer only after setup has succeeded
2020-02-22Merge tag 'x86-urgent-2020-02-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Two fixes for x86: - Remove the __force_oder definiton from the kaslr boot code as it is already defined in the page table code which makes GCC 10 builds fail because it changed the default to -fno-common. - Address the AMD erratum 1054 concerning the IRPERF capability and enable the Instructions Retired fixed counter on machines which are not affected by the erratum" * tag 'x86-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF x86/boot/compressed: Don't declare __force_order in kaslr_64.c
2020-02-19x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERFKim Phillips
Commit aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter") added support for access to the free-running counter via 'perf -e msr/irperf/', but when exercised, it always returns a 0 count: BEFORE: $ perf stat -e instructions,msr/irperf/ true Performance counter stats for 'true': 624,833 instructions 0 msr/irperf/ Simply set its enable bit - HWCR bit 30 - to make it start counting. Enablement is restricted to all machines advertising IRPERF capability, except those susceptible to an erratum that makes the IRPERF return bad values. That erratum occurs in Family 17h models 00-1fh [1], but not in F17h models 20h and above [2]. AFTER (on a family 17h model 31h machine): $ perf stat -e instructions,msr/irperf/ true Performance counter stats for 'true': 621,690 instructions 622,490 msr/irperf/ [1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors [2] Revision Guide for AMD Family 17h Models 30h-3Fh Processors The revision guides are available from the bugzilla Link below. [ bp: Massage commit message. ] Fixes: aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com
2020-02-14x86/mce/amd: Fix kobject lifetimeThomas Gleixner
Accessing the MCA thresholding controls in sysfs concurrently with CPU hotplug can lead to a couple of KASAN-reported issues: BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180 Read of size 8 at addr ffff888367578940 by task grep/4019 and BUG: KASAN: use-after-free in show_error_count+0x15c/0x180 Read of size 2 at addr ffff888368a05514 by task grep/4454 for example. Both result from the fact that the threshold block creation/teardown code frees the descriptor memory itself instead of defining proper ->release function and leaving it to the driver core to take care of that, after all sysfs accesses have completed. Do that and get rid of the custom freeing code, fixing the above UAFs in the process. [ bp: write commit message. ] Fixes: 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de
2020-02-13x86/mce/amd: Publish the bank pointer only after setup has succeededBorislav Petkov
threshold_create_bank() creates a bank descriptor per MCA error thresholding counter which can be controlled over sysfs. It publishes the pointer to that bank in a per-CPU variable and then goes on to create additional thresholding blocks if the bank has such. However, that creation of additional blocks in allocate_threshold_blocks() can fail, leading to a use-after-free through the per-CPU pointer. Therefore, publish that pointer only after all blocks have been setup successfully. Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor") Reported-by: Saar Amar <Saar.Amar@microsoft.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain