aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/include/asm
AgeCommit message (Collapse)Author
2020-01-17Revert "x86: Use CONFIG_PREEMPTION"He Zhe
This reverts commit 41bbdde13b435400816397a47e84fa697934c253. The reverted patch causes the following build failure. tmp-glibc/work-shared/intel-x86-64/kernel-source/include/linux/preempt.h:190:3: error: implicit declaration of function '__preempt_schedule' [-Werror=implicit-function-declaration] The reverted patch should have been used with b8d3349803ba ("sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y") which fixes a50a3f4b6a31 ("sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT") which is not included in 5.2.29. So we simply revert 41bbdde13b43. Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2020-01-09x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make ↵Ingo Molnar
the CPU_ENTRY_AREA_PAGES assert precise commit 05b042a1944322844eaae7ea596d5f154166d68a upstream. When two recent commits that increased the size of the 'struct cpu_entry_area' were merged in -tip, the 32-bit defconfig build started failing on the following build time assert: ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’ In function ‘setup_cpu_entry_area_ptes’, Which corresponds to the following build time assert: BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); The purpose of this assert is to sanity check the fixed-value definition of CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h: #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 41) The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value we didn't want to define in such a low level header, because it would cause dependency hell. Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES accordingly - and this assert is checking that constraint. But the assert is both imprecise and buggy, primarily because it doesn't include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE (which begins at a PMD boundary). This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined too large upstream (v5.4-rc8): #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 40) While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra pages, which hid the bug. The following commit (not yet upstream) increased the size to 40 pages: x86/iopl: ("Restrict iopl() permission scope") ... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap to just 1 extra page. Then another not-yet-upstream commit changed the size again: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Which increased the cpu_entry_area size from 38 to 39 pages, but didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked fine, because we still had a page left from the accidental 'reserve'. But when these two commits were merged into the same tree, the combined size of cpu_entry_area grew from 38 to 40 pages, while CPU_ENTRY_AREA_PAGES finally caught up to 40 as well. Which is fine in terms of functionality, but the assert broke: BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area, which is 1 page larger due to the IDT page. To fix all this, change the assert to two precise asserts: BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE); BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE); This takes the IDT page into account, and also connects the size-based define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based define of CPU_ENTRY_AREA_MAP_SIZE. Also clean up some of the names which made it rather confusing: - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of the cpu-entry-area, but the per-cpu array size, so rename this to CPU_ENTRY_AREA_ARRAY_SIZE. - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping size, with the IDT included. - Add comments where '+1' denotes the IDT mapping - it wasn't obvious and took me about 3 hours to decode... Finally, because this particular commit is actually applied after this patch: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages. All future commits that change cpu_entry_area will have to adjust this value precisely. As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES and derive its value directly from the structure, without causing header hell - but that is an adventure for another day! :-) Fixes: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/cpu_entry_area: Add guard page for entry stack on 32bitThomas Gleixner
commit 880a98c339961eaa074393e3a2117cbe9125b8bb upstream. The entry stack in the cpu entry area is protected against overflow by the readonly GDT on 64-bit, but on 32-bit the GDT needs to be writeable and therefore does not trigger a fault on stack overflow. Add a guard page. Fixes: c482feefe1ae ("x86/entry/64: Make cpu_entry_area.tss read-only") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/stackframe/32: Repair 32-bit Xen PVJan Beulich
commit 81ff2c37f9e5d77593928df0536d86443195fd64 upstream. Once again RPL checks have been introduced which don't account for a 32-bit kernel living in ring 1 when running in a PV Xen domain. The case in FIXUP_FRAME has been preventing boot. Adjust BUG_IF_WRONG_CR3 as well to guard against future uses of the macro on a code path reachable when running in PV mode under Xen; I have to admit that I stopped at a certain point trying to figure out whether there are present ones. Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stable Team <stable@vger.kernel.org> Link: https://lore.kernel.org/r/0fad341f-b7f5-f859-d55d-f0084ee7087e@suse.com Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86: Use CONFIG_PREEMPTIONThomas Gleixner
commit 48593975aeee548f25e256c515fd1d1e3fb2cc20 upstream. CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same functionality which today depends on CONFIG_PREEMPT. Switch the entry code, preempt and kprobes conditionals over to CONFIG_PREEMPTION. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20190726212124.608488448@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/mm, tracing: Fix CR2 corruptionPeter Zijlstra
commit a0d14b8909de55139b8702fe0c7e80b69763dcfb upstream. Despite the current efforts to read CR2 before tracing happens there still exist a number of possible holes: idtentry page_fault do_page_fault has_error_code=1 call error_entry TRACE_IRQS_OFF call trace_hardirqs_off* #PF // modifies CR2 CALL_enter_from_user_mode __context_tracking_exit() trace_user_exit(0) #PF // modifies CR2 call do_page_fault address = read_cr2(); /* whoopsie */ And similar for i386. Fix it by pulling the CR2 read into the entry code, before any of that stuff gets a chance to run and ruin things. Reported-by: He Zhe <zhe.he@windriver.com> Reported-by: Eiichi Tsukata <devel@etsukata.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: bp@alien8.de Cc: rostedt@goodmis.org Cc: torvalds@linux-foundation.org Cc: hpa@zytor.com Cc: dave.hansen@linux.intel.com Cc: jgross@suse.com Cc: joel@joelfernandes.org Link: https://lkml.kernel.org/r/20190711114336.116812491@infradead.org Debugged-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/paravirt: Make read_cr2() CALLEE_SAVEPeter Zijlstra
commit 55aedddb6149ab71bec9f050846855113977b033 upstream. The one paravirt read_cr2() implementation (Xen) is actually quite trivial and doesn't need to clobber anything other than the return register. Making read_cr2() CALLEE_SAVE avoids all the PUSH/POP nonsense and allows more convenient use from assembly. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: bp@alien8.de Cc: rostedt@goodmis.org Cc: luto@kernel.org Cc: torvalds@linux-foundation.org Cc: hpa@zytor.com Cc: dave.hansen@linux.intel.com Cc: zhe.he@windriver.com Cc: joel@joelfernandes.org Cc: devel@etsukata.com Link: https://lkml.kernel.org/r/20190711114335.887392493@infradead.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/asm: Move native_write_cr0/4() out of lineThomas Gleixner
commit 7652ac92018536eb807b6c2130100c85f1ba7e3b upstream. The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n. The reason is that the static key which controls the pinning is marked RO after init. The kvm_intel module contains a CR4 write which requires to update the static key entry list. That obviously does not work when the key is in a RO section. With CONFIG_PARAVIRT enabled this does not happen because the CR4 write uses the paravirt indirection and the actual write function is built in. As the key is intended to be immutable after init, move native_write_cr0/4() out of line. While at it consolidate the update of the cr4 shadow variable and store the value right away when the pinning is initialized on a booting CPU. No point in reading it back 20 instructions later. This allows to confine the static key and the pinning variable to cpu/common and allows to mark them static. Fixes: 8dbec27a242c ("x86/asm: Pin sensitive CR0 bits") Fixes: 873d50d58f67 ("x86/asm: Pin sensitive CR4 bits") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Xi Ruoyao <xry111@mengyan1223.wang> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Xi Ruoyao <xry111@mengyan1223.wang> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907102140340.1758@nanos.tec.linutronix.de Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/pgtable/32: Fix LOWMEM_PAGES constantArnd Bergmann
commit 26515699863d68058e290e18e83f444925920be5 upstream. clang points out that the computation of LOWMEM_PAGES causes a signed integer overflow on 32-bit x86: arch/x86/kernel/head32.c:83:20: error: signed shift result (0x100000000) requires 34 bits to represent, but 'int' only has 32 bits [-Werror,-Wshift-overflow] (PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT); ^~~~~~~~~~~~ arch/x86/include/asm/pgtable_32.h:109:27: note: expanded from macro 'LOWMEM_PAGES' #define LOWMEM_PAGES ((((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT)) ~^ ~~ arch/x86/include/asm/pgtable_32.h:98:34: note: expanded from macro 'PAGE_TABLE_SIZE' #define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD) Use the _ULL() macro to make it a 64-bit constant. Fixes: 1e620f9b23e5 ("x86/boot/32: Convert the 32-bit pgtable setup code from assembly to C") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190710130522.1802800-1-arnd@arndb.de Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/ldt: Initialize the context lock for init_mmSebastian Andrzej Siewior
commit 39ca5fb4920a96eeab478be2cfa6a2369fef6b02 upstream. The mutex mm->context->lock for init_mm is not initialized for init_mm. This wasn't a problem because it remained unused. This changed however since commit 4fc19708b165c ("x86/alternatives: Initialize temporary mm for patching") Initialize the mutex for init_mm. Fixes: 4fc19708b165c ("x86/alternatives: Initialize temporary mm for patching") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20190701173354.2pe62hhliok2afea@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/stackframe/32: Allow int3_emulate_push()Peter Zijlstra
commit faeedb0679bee39ebffc6d53111e86932dea189a upstream. Now that x86_32 has an unconditional gap on the kernel stack frame, the int3_emulate_push() thing will work without further changes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/stackframe/32: Provide consistent pt_regsPeter Zijlstra
commit 3c88c692c28746473791276f8b42d2c989d6cbe6 upstream. Currently pt_regs on x86_32 has an oddity in that kernel regs (!user_mode(regs)) are short two entries (esp/ss). This means that any code trying to use them (typically: regs->sp) needs to jump through some unfortunate hoops. Change the entry code to fix this up and create a full pt_regs frame. This then simplifies various trampolines in ftrace and kprobes, the stack unwinder, ptrace, kdump and kgdb. Much thanks to Josh for help with the cleanups! Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/stackframe: Move ENCODE_FRAME_POINTER to asm/frame.hPeter Zijlstra
commit a9b3c6998d4a7d53a787cf4d0fd4a4c11239e517 upstream. In preparation for wider use, move the ENCODE_FRAME_POINTER macros to a common header and provide inline asm versions. These macros are used to encode a pt_regs frame for the unwinder; see unwind_frame.c:decode_frame_pointer(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/asm: Pin sensitive CR0 bitsKees Cook
commit 8dbec27a242cd3e2816eeb98d3237b9f57cf6232 upstream. With sensitive CR4 bits pinned now, it's possible that the WP bit for CR0 might become a target as well. Following the same reasoning for the CR4 pinning, pin CR0's WP bit. Contrary to the cpu feature dependend CR4 pinning this can be done with a constant value. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: kernel-hardening@lists.openwall.com Link: https://lkml.kernel.org/r/20190618045503.39105-4-keescook@chromium.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-01-09x86/asm: Pin sensitive CR4 bitsKees Cook
commit 873d50d58f67ef15d2777b5e7f7a5268bb1fbae2 upstream. Several recent exploits have used direct calls to the native_write_cr4() function to disable SMEP and SMAP before then continuing their exploits using userspace memory access. Direct calls of this form can be mitigate by pinning bits of CR4 so that they cannot be changed through a common function. This is not intended to be a general ROP protection (which would require CFI to defend against properly), but rather a way to avoid trivial direct function calling (or CFI bypasses via a matching function prototype) as seen in: https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html (https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308) The goals of this change: - Pin specific bits (SMEP, SMAP, and UMIP) when writing CR4. - Avoid setting the bits too early (they must become pinned only after CPU feature detection and selection has finished). - Pinning mask needs to be read-only during normal runtime. - Pinning needs to be checked after write to validate the cr4 state Using __ro_after_init on the mask is done so it can't be first disabled with a malicious write. Since these bits are global state (once established by the boot CPU and kernel boot parameters), they are safe to write to secondary CPUs before those CPUs have finished feature detection. As such, the bits are set at the first cr4 write, so that cr4 write bugs can be detected (instead of silently papered over). This uses a few bytes less storage of a location we don't have: read-only per-CPU data. A check is performed after the register write because an attack could just skip directly to the register write. Such a direct jump is possible because of how this function may be built by the compiler (especially due to the removal of frame pointers) where it doesn't add a stack frame (function exit may only be a retq without pops) which is sufficient for trivial exploitation like in the timer overwrites mentioned above). The asm argument constraints gain the "+" modifier to convince the compiler that it shouldn't make ordering assumptions about the arguments or memory, and treat them as changed. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: kernel-hardening@lists.openwall.com Link: https://lkml.kernel.org/r/20190618045503.39105-3-keescook@chromium.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10x86/fpu: Don't cache access to fpu_fpregs_owner_ctxSebastian Andrzej Siewior
commit 59c4bd853abcea95eccc167a7d7fd5f1a5f47b98 upstream. The state/owner of the FPU is saved to fpu_fpregs_owner_ctx by pointing to the context that is currently loaded. It never changed during the lifetime of a task - it remained stable/constant. After deferred FPU registers loading until return to userland was implemented, the content of fpu_fpregs_owner_ctx may change during preemption and must not be cached. This went unnoticed for some time and was now noticed, in particular since gcc 9 is caching that load in copy_fpstate_to_sigframe() and reusing it in the retry loop: copy_fpstate_to_sigframe() load fpu_fpregs_owner_ctx and save on stack fpregs_lock() copy_fpregs_to_sigframe() /* failed */ fpregs_unlock() *** PREEMPTION, another uses FPU, changes fpu_fpregs_owner_ctx *** fault_in_pages_writeable() /* succeed, retry */ fpregs_lock() __fpregs_load_activate() fpregs_state_valid() /* uses fpu_fpregs_owner_ctx from stack */ copy_fpregs_to_sigframe() /* succeeds, random FPU content */ This is a comparison of the assembly produced by gcc 9, without vs with this patch: | # arch/x86/kernel/fpu/signal.c:173: if (!access_ok(buf, size)) | cmpq %rdx, %rax # tmp183, _4 | jb .L190 #, |-# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |-#APP |-# 512 "arch/x86/include/asm/fpu/internal.h" 1 |- movq %gs:fpu_fpregs_owner_ctx,%rax #, pfo_ret__ |-# 0 "" 2 |-#NO_APP |- movq %rax, -88(%rbp) # pfo_ret__, %sfp … |-# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |- movq -88(%rbp), %rcx # %sfp, pfo_ret__ |- cmpq %rcx, -64(%rbp) # pfo_ret__, %sfp |+# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |+#APP |+# 512 "arch/x86/include/asm/fpu/internal.h" 1 |+ movq %gs:fpu_fpregs_owner_ctx(%rip),%rax # fpu_fpregs_owner_ctx, pfo_ret__ |+# 0 "" 2 |+# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |+#NO_APP |+ cmpq %rax, -64(%rbp) # pfo_ret__, %sfp Use this_cpu_read() instead this_cpu_read_stable() to avoid caching of fpu_fpregs_owner_ctx during preemption points. The Fixes: tag points to the commit where deferred FPU loading was added. Since this commit, the compiler is no longer allowed to move the load of fpu_fpregs_owner_ctx somewhere else / outside of the locked section. A task preemption will change its value and stale content will be observed. [ bp: Massage. ] Debugged-by: Austin Clements <austin@google.com> Debugged-by: David Chase <drchase@golang.org> Debugged-by: Ian Lance Taylor <ian@airs.com> Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Borislav Petkov <bp@suse.de> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Austin Clements <austin@google.com> Cc: Barret Rhoden <brho@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Chase <drchase@golang.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: ian@airs.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Bleecher Snyder <josharian@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191128085306.hxfa2o3knqtu4wfn@linutronix.de Link: https://bugzilla.kernel.org/show_bug.cgi?id=205663 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-12-10x86/cpu: Add Comet Lake to the Intel CPU models headerKan Liang
commit 8d7c6ac3b2371eb1cbc9925a88f4d10efff374de upstream. Comet Lake is the new 10th Gen Intel processor. Add two new CPU model numbers to the Intel family list. The CPU model numbers are not published in the SDM yet but they come from an authoritative internal source. [ bp: Touch up commit message. ] Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: ak@linux.intel.com Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/1570549810-25049-2-git-send-email-kan.liang@linux.intel.com Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-25uaccess: implement a proper unsafe_copy_to_user() and switch filldir over to itLinus Torvalds
commit c512c69187197fe08026cb5bbe7b9709f4f89b73 upstream. In commit 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") I made filldir() use unsafe_put_user(), which improves code generation on x86 enormously. But because we didn't have a "unsafe_copy_to_user()", the dirent name copy was also done by hand with unsafe_put_user() in a loop, and it turns out that a lot of other architectures didn't like that, because unlike x86, they have various alignment issues. Most non-x86 architectures trap and fix it up, and some (like xtensa) will just fail unaligned put_user() accesses unconditionally. Which makes that "copy using put_user() in a loop" not work for them at all. I could make that code do explicit alignment etc, but the architectures that don't like unaligned accesses also don't really use the fancy "user_access_begin/end()" model, so they might just use the regular old __copy_to_user() interface. So this commit takes that looping implementation, turns it into the x86 version of "unsafe_copy_to_user()", and makes other architectures implement the unsafe copy version as __copy_to_user() (the same way they do for the other unsafe_xyz() accessor functions). Note that it only does this for the copying _to_ user space, and we still don't have a unsafe version of copy_from_user(). That's partly because we have no current users of it, but also partly because the copy_from_user() case is slightly different and cannot efficiently be implemented in terms of a unsafe_get_user() loop (because gcc can't do asm goto with outputs). It would be trivial to do this using "rep movsb", which would work really nicely on newer x86 cores, but really badly on some older ones. Al Viro is looking at cleaning up all our user copy routines to make this all a non-issue, but for now we have this simple-but-stupid version for x86 that works fine for the dirent name copy case because those names are short strings and we simply don't need anything fancier. Fixes: 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-and-tested-by: Tony Luck <tony.luck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-14kvm: x86: mmu: Recovery of shattered NX large pagesJunaid Shahid
commit 1aa9b9572b10529c2e64e2b8f44025d86e124308 upstream. The page table pages corresponding to broken down large pages are zapped in FIFO order, so that the large page can potentially be recovered, if it is not longer being used for execution. This removes the performance penalty for walking deeper EPT page tables. By default, one large page will last about one hour once the guest reaches a steady state. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-14kvm: mmu: ITLB_MULTIHIT mitigationPaolo Bonzini
commit b8e8c8303ff28c61046a4d0f6ea99aea609a7dc0 upstream. With some Intel processors, putting the same virtual address in the TLB as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit and cause the processor to issue a machine check resulting in a CPU lockup. Unfortunately when EPT page tables use huge pages, it is possible for a malicious guest to cause this situation. Add a knob to mark huge pages as non-executable. When the nx_huge_pages parameter is enabled (and we are using EPT), all huge pages are marked as NX. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable. This is not an issue for shadow paging (except nested EPT), because then the host is in control of TLB flushes and the problematic situation cannot happen. With nested EPT, again the nested guest can cause problems shadow and direct EPT is treated in the same way. [ tglx: Fixup default to auto and massage wording a bit ] Originally-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-14x86/bugs: Add ITLB_MULTIHIT bug infrastructureVineela Tummalapalli
commit db4d30fbb71b47e4ecb11c4efa5d8aad4b03dfae upstream. Some processors may incur a machine check error possibly resulting in an unrecoverable CPU lockup when an instruction fetch encounters a TLB multi-hit in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. The relevant erratum can be found here: https://bugzilla.kernel.org/show_bug.cgi?id=205195 There are other processors affected for which the erratum does not fully disclose the impact. This issue affects both bare-metal x86 page tables and EPT. It can be mitigated by either eliminating the use of large pages or by using careful TLB invalidations when changing the page size in the page tables. Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which are mitigated against this issue. Signed-off-by: Vineela Tummalapalli <vineela.tummalapalli@intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [PG: drop AIRMONT_NP - not present in v5.2.x codebase.] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-14x86/speculation/taa: Add mitigation for TSX Async AbortPawan Gupta
commit 1b42f017415b46c317e71d41c34ec088417a1883 upstream. TSX Async Abort (TAA) is a side channel vulnerability to the internal buffers in some Intel processors similar to Microachitectural Data Sampling (MDS). In this case, certain loads may speculatively pass invalid data to dependent operations when an asynchronous abort condition is pending in a TSX transaction. This includes loads with no fault or assist condition. Such loads may speculatively expose stale data from the uarch data structures as in MDS. Scope of exposure is within the same-thread and cross-thread. This issue affects all current processors that support TSX, but do not have ARCH_CAP_TAA_NO (bit 8) set in MSR_IA32_ARCH_CAPABILITIES. On CPUs which have their IA32_ARCH_CAPABILITIES MSR bit MDS_NO=0, CPUID.MD_CLEAR=1 and the MDS mitigation is clearing the CPU buffers using VERW or L1D_FLUSH, there is no additional mitigation needed for TAA. On affected CPUs with MDS_NO=1 this issue can be mitigated by disabling the Transactional Synchronization Extensions (TSX) feature. A new MSR IA32_TSX_CTRL in future and current processors after a microcode update can be used to control the TSX feature. There are two bits in that MSR: * TSX_CTRL_RTM_DISABLE disables the TSX sub-feature Restricted Transactional Memory (RTM). * TSX_CTRL_CPUID_CLEAR clears the RTM enumeration in CPUID. The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled with updated microcode but still enumerated as present by CPUID(EAX=7).EBX{bit4}. The second mitigation approach is similar to MDS which is clearing the affected CPU buffers on return to user space and when entering a guest. Relevant microcode update is required for the mitigation to work. More details on this approach can be found here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html The TSX feature can be controlled by the "tsx" command line parameter. If it is force-enabled then "Clear CPU buffers" (MDS mitigation) is deployed. The effective mitigation state can be read from sysfs. [ bp: - massage + comments cleanup - s/TAA_MITIGATION_TSX_DISABLE/TAA_MITIGATION_TSX_DISABLED/g - Josh. - remove partial TAA mitigation in update_mds_branch_idle() - Josh. - s/tsx_async_abort_cmdline/tsx_async_abort_parse_cmdline/g ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-14x86/msr: Add the IA32_TSX_CTRL MSRPawan Gupta
commit c2955f270a84762343000f103e0640d29c7a96f3 upstream. Transactional Synchronization Extensions (TSX) may be used on certain processors as part of a speculative side channel attack. A microcode update for existing processors that are vulnerable to this attack will add a new MSR - IA32_TSX_CTRL to allow the system administrator the option to disable TSX as one of the possible mitigations. The CPUs which get this new MSR after a microcode upgrade are the ones which do not set MSR_IA32_ARCH_CAPABILITIES.MDS_NO (bit 5) because those CPUs have CPUID.MD_CLEAR, i.e., the VERW implementation which clears all CPU buffers takes care of the TAA case as well. [ Note that future processors that are not vulnerable will also support the IA32_TSX_CTRL MSR. ] Add defines for the new IA32_TSX_CTRL MSR and its bits. TSX has two sub-features: 1. Restricted Transactional Memory (RTM) is an explicitly-used feature where new instructions begin and end TSX transactions. 2. Hardware Lock Elision (HLE) is implicitly used when certain kinds of "old" style locks are used by software. Bit 7 of the IA32_ARCH_CAPABILITIES indicates the presence of the IA32_TSX_CTRL MSR. There are two control bits in IA32_TSX_CTRL MSR: Bit 0: When set, it disables the Restricted Transactional Memory (RTM) sub-feature of TSX (will force all transactions to abort on the XBEGIN instruction). Bit 1: When set, it disables the enumeration of the RTM and HLE feature (i.e. it will make CPUID(EAX=7).EBX{bit4} and CPUID(EAX=7).EBX{bit11} read as 0). The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled by the new microcode but still enumerated as present by CPUID(EAX=7).EBX{bit4}, unless disabled by IA32_TSX_CTRL_MSR[1] - TSX_CTRL_CPUID_CLEAR. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Mark Gross <mgross@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-14x86/cpu: Move arch_smt_update() to a neutral placeThomas Gleixner
commit 9c92374b631d233abf5bd355cb4253d3d83d5578 upstream. arch_smt_update() will be used to control IPI/NMI broadcasting via the shorthand mechanism. Keeping it in the bugs file and calling the apic function from there is possible, but not really intuitive. Move it to a neutral place and invoke the bugs function from there. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190722105219.910317273@linutronix.de Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-11-09x86/asm: Fix MWAITX C-state hint valueJanakarajan Natarajan
commit 454de1e7d970d6bc567686052329e4814842867c upstream. As per "AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions", MWAITX EAX[7:4]+1 specifies the optional hint of the optimized C-state. For C0 state, EAX[7:4] should be set to 0xf. Currently, a value of 0xf is set for EAX[3:0] instead of EAX[7:4]. Fix this by changing MWAITX_DISABLE_CSTATES from 0xf to 0xf0. This hasn't had any implications so far because setting reserved bits in EAX is simply ignored by the CPU. [ bp: Fixup comment in delay_mwaitx() and massage. ] Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191007190011.4859-1-Janakarajan.Natarajan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-10-05KVM: x86: Disable posted interrupts for non-standard IRQs delivery modesAlexander Graf
commit fdcf756213756c23b533ca4974d1f48c6a4d4281 upstream. We can easily route hardware interrupts directly into VM context when they target the "Fixed" or "LowPriority" delivery modes. However, on modes such as "SMI" or "Init", we need to go via KVM code to actually put the vCPU into a different mode of operation, so we can not post the interrupt Add code in the VMX and SVM PI logic to explicitly refuse to establish posted mappings for advanced IRQ deliver modes. This reflects the logic in __apic_accept_irq() which also only ever passes Fixed and LowPriority interrupts as posted interrupts into the guest. This fixes a bug I have with code which configures real hardware to inject virtual SMIs into my guest. Signed-off-by: Alexander Graf <graf@amazon.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05x86/cpu: Add Tiger Lake to Intel familyGayatri Kammela
[ Upstream commit 6e1c32c5dbb4b90eea8f964c2869d0bde050dbe0 ] Add the model numbers/CPUIDs of Tiger Lake mobile and desktop to the Intel family. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rahul Tanwar <rahul.tanwar@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190905193020.14707-2-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21x86/uaccess: Don't leak the AC flags into __get_user() argument evaluationPeter Zijlstra
[ Upstream commit 9b8bd476e78e89c9ea26c3b435ad0201c3d7dbf5 ] Identical to __put_user(); the __get_user() argument evalution will too leak UBSAN crud into the __uaccess_begin() / __uaccess_end() region. While uncommon this was observed to happen for: drivers/xen/gntdev.c: if (__get_user(old_status, batch->status[i])) where UBSAN added array bound checking. This complements commit: 6ae865615fc4 ("x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation") Tested-by Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: broonie@kernel.org Cc: sfr@canb.auug.org.au Cc: akpm@linux-foundation.org Cc: Randy Dunlap <rdunlap@infradead.org> Cc: mhocko@suse.cz Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20190829082445.GM2369@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21perf/x86/amd/ibs: Fix sample bias for dispatched micro-opsKim Phillips
[ Upstream commit 0f4cd769c410e2285a4e9873a684d90423f03090 ] When counting dispatched micro-ops with cnt_ctl=1, in order to prevent sample bias, IBS hardware preloads the least significant 7 bits of current count (IbsOpCurCnt) with random values, such that, after the interrupt is handled and counting resumes, the next sample taken will be slightly perturbed. The current count bitfield is in the IBS execution control h/w register, alongside the maximum count field. Currently, the IBS driver writes that register with the maximum count, leaving zeroes to fill the current count field, thereby overwriting the random bits the hardware preloaded for itself. Fix the driver to actually retain and carry those random bits from the read of the IBS control register, through to its write, instead of overwriting the lower current count bits with zeroes. Tested with: perf record -c 100001 -e ibs_op/cnt_ctl=1/pp -a -C 0 taskset -c 0 <workload> 'perf annotate' output before: 15.70 65: addsd %xmm0,%xmm1 17.30 add $0x1,%rax 15.88 cmp %rdx,%rax je 82 17.32 72: test $0x1,%al jne 7c 7.52 movapd %xmm1,%xmm0 5.90 jmp 65 8.23 7c: sqrtsd %xmm1,%xmm0 12.15 jmp 65 'perf annotate' output after: 16.63 65: addsd %xmm0,%xmm1 16.82 add $0x1,%rax 16.81 cmp %rdx,%rax je 82 16.69 72: test $0x1,%al jne 7c 8.30 movapd %xmm1,%xmm0 8.13 jmp 65 8.24 7c: sqrtsd %xmm1,%xmm0 8.39 jmp 65 Tested on Family 15h and 17h machines. Machines prior to family 10h Rev. C don't have the RDWROPCNT capability, and have the IbsOpCurCnt bitfield reserved, so this patch shouldn't affect their operation. It is unknown why commit db98c5faf8cb ("perf/x86: Implement 64-bit counter support for IBS") ignored the lower 4 bits of the IbsOpCurCnt field; the number of preloaded random bits has always been 7, AFAICT. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Arnaldo Carvalho de Melo" <acme@kernel.org> Cc: <x86@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Borislav Petkov" <bp@alien8.de> Cc: Stephane Eranian <eranian@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: "Namhyung Kim" <namhyung@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20190826195730.30614-1-kim.phillips@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-19KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslotSean Christopherson
commit 002c5f73c508f7df5681bda339831c27f3c1aef4 upstream. James Harvey reported a livelock that was introduced by commit d012a06ab1d23 ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot""). The livelock occurs because kvm_mmu_zap_all() as it exists today will voluntarily reschedule and drop KVM's mmu_lock, which allows other vCPUs to add shadow pages. With enough vCPUs, kvm_mmu_zap_all() can get stuck in an infinite loop as it can never zap all pages before observing lock contention or the need to reschedule. The equivalent of kvm_mmu_zap_all() that was in use at the time of the reverted commit (4e103134b8623, "KVM: x86/mmu: Zap only the relevant pages when removing a memslot") employed a fast invalidate mechanism and was not susceptible to the above livelock. There are three ways to fix the livelock: - Reverting the revert (commit d012a06ab1d23) is not a viable option as the revert is needed to fix a regression that occurs when the guest has one or more assigned devices. It's unlikely we'll root cause the device assignment regression soon enough to fix the regression timely. - Remove the conditional reschedule from kvm_mmu_zap_all(). However, although removing the reschedule would be a smaller code change, it's less safe in the sense that the resulting kvm_mmu_zap_all() hasn't been used in the wild for flushing memslots since the fast invalidate mechanism was introduced by commit 6ca18b6950f8d ("KVM: x86: use the fast way to invalidate all pages"), back in 2013. - Reintroduce the fast invalidate mechanism and use it when zapping shadow pages in response to a memslot being deleted/moved, which is what this patch does. For all intents and purposes, this is a revert of commit ea145aacf4ae8 ("Revert "KVM: MMU: fast invalidate all pages"") and a partial revert of commit 7390de1e99a70 ("Revert "KVM: x86: use the fast way to invalidate all pages""), i.e. restores the behavior of commit 5304b8d37c2a5 ("KVM: MMU: fast invalidate all pages") and commit 6ca18b6950f8d ("KVM: x86: use the fast way to invalidate all pages") respectively. Fixes: d012a06ab1d23 ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"") Reported-by: James Harvey <jamespharvey20@gmail.com> Cc: Alex Willamson <alex.williamson@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-10x86/boot: Preserve boot_params.secure_boot from sanitizingJohn S. Gruber
commit 29d9a0b50736768f042752070e5cdf4e4d4c00df upstream. Commit a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") now zeroes the secure boot setting information (enabled/disabled/...) passed by the boot loader or by the kernel's EFI handover mechanism. The problem manifests itself with signed kernels using the EFI handoff protocol with grub and the kernel loses the information whether secure boot is enabled in the firmware, i.e., the log message "Secure boot enabled" becomes "Secure boot could not be determined". efi_main() arch/x86/boot/compressed/eboot.c sets this field early but it is subsequently zeroed by the above referenced commit. Include boot_params.secure_boot in the preserve field list. [ bp: restructure commit message and massage. ] Fixes: a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") Signed-off-by: John S. Gruber <JohnSGruber@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Mark Brown <broonie@kernel.org> Cc: stable <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/CAPotdmSPExAuQcy9iAHqX3js_fc4mMLQOTr5RBGvizyCOPcTQQ@mail.gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29x86/boot: Fix boot regression caused by bootparam sanitizingJohn Hubbard
commit 7846f58fba964af7cb8cf77d4d13c33254725211 upstream. commit a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") had two errors: * It preserved boot_params.acpi_rsdp_addr, and * It failed to preserve boot_params.hdr Therefore, zero out acpi_rsdp_addr, and preserve hdr. Fixes: a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") Reported-by: Neil MacLeod <neil@nmacleod.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neil MacLeod <neil@nmacleod.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190821192513.20126-1-jhubbard@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29x86/boot: Save fields explicitly, zero out everything elseJohn Hubbard
commit a90118c445cc7f07781de26a9684d4ec58bfcfd1 upstream. Recent gcc compilers (gcc 9.1) generate warnings about an out of bounds memset, if the memset goes accross several fields of a struct. This generated a couple of warnings on x86_64 builds in sanitize_boot_params(). Fix this by explicitly saving the fields in struct boot_params that are intended to be preserved, and zeroing all the rest. [ tglx: Tagged for stable as it breaks the warning free build there as well ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190731054627.5627-2-jhubbard@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16hTom Lendacky
commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24 upstream. There have been reports of RDRAND issues after resuming from suspend on some AMD family 15h and family 16h systems. This issue stems from a BIOS not performing the proper steps during resume to ensure RDRAND continues to function properly. RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND support using CPUID, including the kernel, will believe that RDRAND is not supported. Update the CPU initialization to clear the RDRAND CPUID bit for any family 15h and 16h processor that supports RDRAND. If it is known that the family 15h or family 16h system does not have an RDRAND resume issue or that the system will not be placed in suspend, the "rdrand=force" kernel parameter can be used to stop the clearing of the RDRAND CPUID bit. Additionally, update the suspend and resume path to save and restore the MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in place after resuming from suspend. Note, that clearing the RDRAND CPUID bit does not prevent a processor that normally supports the RDRAND instruction from executing it. So any code that determined the support based on family and model won't #UD. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chen Yu <yu.c.chen@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org> Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-29x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386Sean Christopherson
commit b63f20a778c88b6a04458ed6ffc69da953d3a109 upstream. Use 'lea' instead of 'add' when adjusting %rsp in CALL_NOSPEC so as to avoid clobbering flags. KVM's emulator makes indirect calls into a jump table of sorts, where the destination of the CALL_NOSPEC is a small blob of code that performs fast emulation by executing the target instruction with fixed operands. adcb_al_dl: 0x000339f8 <+0>: adc %dl,%al 0x000339fa <+2>: ret A major motiviation for doing fast emulation is to leverage the CPU to handle consumption and manipulation of arithmetic flags, i.e. RFLAGS is both an input and output to the target of CALL_NOSPEC. Clobbering flags results in all sorts of incorrect emulation, e.g. Jcc instructions often take the wrong path. Sans the nops... asm("push %[flags]; popf; " CALL_NOSPEC " ; pushf; pop %[flags]\n" 0x0003595a <+58>: mov 0xc0(%ebx),%eax 0x00035960 <+64>: mov 0x60(%ebx),%edx 0x00035963 <+67>: mov 0x90(%ebx),%ecx 0x00035969 <+73>: push %edi 0x0003596a <+74>: popf 0x0003596b <+75>: call *%esi 0x000359a0 <+128>: pushf 0x000359a1 <+129>: pop %edi 0x000359a2 <+130>: mov %eax,0xc0(%ebx) 0x000359b1 <+145>: mov %edx,0x60(%ebx) ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK); 0x000359a8 <+136>: mov -0x10(%ebp),%eax 0x000359ab <+139>: and $0x8d5,%edi 0x000359b4 <+148>: and $0xfffff72a,%eax 0x000359b9 <+153>: or %eax,%edi 0x000359bd <+157>: mov %edi,0x4(%ebx) For the most part this has gone unnoticed as emulation of guest code that can trigger fast emulation is effectively limited to MMIO when running on modern hardware, and MMIO is rarely, if ever, accessed by instructions that affect or consume flags. Breakage is almost instantaneous when running with unrestricted guest disabled, in which case KVM must emulate all instructions when the guest has invalid state, e.g. when the guest is in Big Real Mode during early BIOS. Fixes: 776b043848fd2 ("x86/retpoline: Add initial retpoline support") Fixes: 1a29b5b7f347a ("KVM: x86: Make indirect calls in emulator speculation safe") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190822211122.27579-1-sean.j.christopherson@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16KVM: Fix leak vCPU's VMCS value into other pCPUWanpeng Li
commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream. After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGSThomas Gleixner
commit f36cf386e3fec258a341d446915862eded3e13d8 upstream Intel provided the following information: On all current Atom processors, instructions that use a segment register value (e.g. a load or store) will not speculatively execute before the last writer of that segment retires. Thus they will not use a speculatively written segment value. That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS entry paths can be excluded from the extra LFENCE if PTI is disabled. Create a separate bug flag for the through SWAPGS speculation and mark all out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs are excluded from the whole mitigation mess anyway. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06x86/speculation: Prepare entry code for Spectre v1 swapgs mitigationsJosh Poimboeuf
commit 18ec54fdd6d18d92025af097cd042a75cf0ea24c upstream Spectre v1 isn't only about array bounds checks. It can affect any conditional checks. The kernel entry code interrupt, exception, and NMI handlers all have conditional swapgs checks. Those may be problematic in the context of Spectre v1, as kernel code can speculatively run with a user GS. For example: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg mov (%reg), %reg1 When coming from user space, the CPU can speculatively skip the swapgs, and then do a speculative percpu load using the user GS value. So the user can speculatively force a read of any kernel value. If a gadget exists which uses the percpu value as an address in another load/store, then the contents of the kernel value may become visible via an L1 side channel attack. A similar attack exists when coming from kernel space. The CPU can speculatively do the swapgs, causing the user GS to get used for the rest of the speculative window. The mitigation is similar to a traditional Spectre v1 mitigation, except: a) index masking isn't possible; because the index (percpu offset) isn't user-controlled; and b) an lfence is needed in both the "from user" swapgs path and the "from kernel" non-swapgs path (because of the two attacks described above). The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a CR3 write when PTI is enabled. Since CR3 writes are serializing, the lfences can be skipped in those cases. On the other hand, the kernel entry swapgs paths don't depend on PTI. To avoid unnecessary lfences for the user entry case, create two separate features for alternative patching: X86_FEATURE_FENCE_SWAPGS_USER X86_FEATURE_FENCE_SWAPGS_KERNEL Use these features in entry code to patch in lfences where needed. The features aren't enabled yet, so there's no functional change. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06x86/cpufeatures: Combine word 11 and 12 into a new scattered features wordFenghua Yu
commit acec0ce081de0c36459eea91647faf99296445a3 upstream It's a waste for the four X86_FEATURE_CQM_* feature bits to occupy two whole feature bits words. To better utilize feature words, re-define word 11 to host scattered features and move the four X86_FEATURE_CQM_* features into Linux defined word 11. More scattered features can be added in word 11 in the future. Rename leaf 11 in cpuid_leafs to CPUID_LNX_4 to reflect it's a Linux-defined leaf. Rename leaf 12 as CPUID_DUMMY which will be replaced by a meaningful name in the next patch when CPUID.7.1:EAX occupies world 12. Maximum number of RMID and cache occupancy scale are retrieved from CPUID.0xf.1 after scattered CQM features are enumerated. Carve out the code into a separate function. KVM doesn't support resctrl now. So it's safe to move the X86_FEATURE_CQM_* features to scattered features word 11 for KVM. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Aaron Lewis <aaronlewis@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Babu Moger <babu.moger@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: "Sean J Christopherson" <sean.j.christopherson@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Ravi V Shankar <ravi.v.shankar@intel.com> Cc: Sherry Hurwitz <sherry.hurwitz@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: x86 <x86@kernel.org> Link: https://lkml.kernel.org/r/1560794416-217638-2-git-send-email-fenghua.yu@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-06x86/paravirt: Fix callee-saved function ELF sizesJosh Poimboeuf
[ Upstream commit 083db6764821996526970e42d09c1ab2f4155dd4 ] The __raw_callee_save_*() functions have an ELF symbol size of zero, which confuses objtool and other tools. Fixes a bunch of warnings like the following: arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_pte_val() is missing an ELF size annotation arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_pgd_val() is missing an ELF size annotation arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_make_pte() is missing an ELF size annotation arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_make_pgd() is missing an ELF size annotation Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/afa6d49bb07497ca62e4fc3b27a2d0cece545b4e.1563413318.git.jpoimboe@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06x86/kvm: Don't call kvm_spurious_fault() from .fixupJosh Poimboeuf
[ Upstream commit 3901336ed9887b075531bffaeef7742ba614058b ] After making a change to improve objtool's sibling call detection, it started showing the following warning: arch/x86/kvm/vmx/nested.o: warning: objtool: .fixup+0x15: sibling call from callable instruction with modified stack frame The problem is the ____kvm_handle_fault_on_reboot() macro. It does a fake call by pushing a fake RIP and doing a jump. That tricks the unwinder into printing the function which triggered the exception, rather than the .fixup code. Instead of the hack to make it look like the original function made the call, just change the macro so that the original function actually does make the call. This allows removal of the hack, and also makes objtool happy. I triggered a vmx instruction exception and verified that the stack trace is still sane: kernel BUG at arch/x86/kvm/x86.c:358! invalid opcode: 0000 [#1] SMP PTI CPU: 28 PID: 4096 Comm: qemu-kvm Not tainted 5.2.0+ #16 Hardware name: Lenovo THINKSYSTEM SD530 -[7X2106Z000]-/-[7X2106Z000]-, BIOS -[TEE113Z-1.00]- 07/17/2017 RIP: 0010:kvm_spurious_fault+0x5/0x10 Code: 00 00 00 00 00 8b 44 24 10 89 d2 45 89 c9 48 89 44 24 10 8b 44 24 08 48 89 44 24 08 e9 d4 40 22 00 0f 1f 40 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41 RSP: 0018:ffffbf91c683bd00 EFLAGS: 00010246 RAX: 000061f040000000 RBX: ffff9e159c77bba0 RCX: ffff9e15a5c87000 RDX: 0000000665c87000 RSI: ffff9e15a5c87000 RDI: ffff9e159c77bba0 RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9e15a5c87000 R10: 0000000000000000 R11: fffff8f2d99721c0 R12: ffff9e159c77bba0 R13: ffffbf91c671d960 R14: ffff9e159c778000 R15: 0000000000000000 FS: 00007fa341cbe700(0000) GS:ffff9e15b7400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fdd38356804 CR3: 00000006759de003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: loaded_vmcs_init+0x4f/0xe0 alloc_loaded_vmcs+0x38/0xd0 vmx_create_vcpu+0xf7/0x600 kvm_vm_ioctl+0x5e9/0x980 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? free_one_page+0x13f/0x4e0 do_vfs_ioctl+0xa4/0x630 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x55/0x1c0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fa349b1ee5b Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/64a9b64d127e87b6920a97afde8e96ea76f6524e.1563413318.git.jpoimboe@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06xen/pv: Fix a boot up hang revealed by int3 self testZhenzhong Duan
[ Upstream commit b23e5844dfe78a80ba672793187d3f52e4b528d7 ] Commit 7457c0da024b ("x86/alternatives: Add int3_emulate_call() selftest") is used to ensure there is a gap setup in int3 exception stack which could be used for inserting call return address. This gap is missed in XEN PV int3 exception entry path, then below panic triggered: [ 0.772876] general protection fault: 0000 [#1] SMP NOPTI [ 0.772886] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0+ #11 [ 0.772893] RIP: e030:int3_magic+0x0/0x7 [ 0.772905] RSP: 3507:ffffffff82203e98 EFLAGS: 00000246 [ 0.773334] Call Trace: [ 0.773334] alternative_instructions+0x3d/0x12e [ 0.773334] check_bugs+0x7c9/0x887 [ 0.773334] ? __get_locked_pte+0x178/0x1f0 [ 0.773334] start_kernel+0x4ff/0x535 [ 0.773334] ? set_init_arg+0x55/0x55 [ 0.773334] xen_start_kernel+0x571/0x57a For 64bit PV guests, Xen's ABI enters the kernel with using SYSRET, with %rcx/%r11 on the stack. To convert back to "normal" looking exceptions, the xen thunks do 'xen_*: pop %rcx; pop %r11; jmp *'. E.g. Extracting 'xen_pv_trap xenint3' we have: xen_xenint3: pop %rcx; pop %r11; jmp xenint3 As xenint3 and int3 entry code are same except xenint3 doesn't generate a gap, we can fix it by using int3 and drop useless xenint3. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-06x86/apic: Silence -Wtype-limits compiler warningsQian Cai
[ Upstream commit ec6335586953b0df32f83ef696002063090c7aef ] There are many compiler warnings like this, In file included from ./arch/x86/include/asm/smp.h:13, from ./arch/x86/include/asm/mmzone_64.h:11, from ./arch/x86/include/asm/mmzone.h:5, from ./include/linux/mmzone.h:969, from ./include/linux/gfp.h:6, from ./include/linux/mm.h:10, from arch/x86/kernel/apic/io_apic.c:34: arch/x86/kernel/apic/io_apic.c: In function 'check_timer': ./arch/x86/include/asm/apic.h:37:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits] if ((v) <= apic_verbosity) \ ^~ arch/x86/kernel/apic/io_apic.c:2160:2: note: in expansion of macro 'apic_printk' apic_printk(APIC_QUIET, KERN_INFO "..TIMER: vector=0x%02X " ^~~~~~~~~~~ ./arch/x86/include/asm/apic.h:37:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits] if ((v) <= apic_verbosity) \ ^~ arch/x86/kernel/apic/io_apic.c:2207:4: note: in expansion of macro 'apic_printk' apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: " ^~~~~~~~~~~ APIC_QUIET is 0, so silence them by making apic_verbosity type int. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1562621805-24789-1-git-send-email-cai@lca.pw Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-28Revert "kvm: x86: Use task structs fpu field for user"Paolo Bonzini
commit ec269475cba7bcdd1eb8fdf8e87f4c6c81a376fe upstream. This reverts commit 240c35a3783ab9b3a0afaba0dde7291295680a6b ("kvm: x86: Use task structs fpu field for user", 2018-11-06). The commit is broken and causes QEMU's FPU state to be destroyed when KVM_RUN is preempted. Fixes: 240c35a3783a ("kvm: x86: Use task structs fpu field for user") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26x86/atomic: Fix smp_mb__{before,after}_atomic()Peter Zijlstra
[ Upstream commit 69d927bba39517d0980462efc051875b7f4db185 ] Recent probing at the Linux Kernel Memory Model uncovered a 'surprise'. Strongly ordered architectures where the atomic RmW primitive implies full memory ordering and smp_mb__{before,after}_atomic() are a simple barrier() (such as x86) fail for: *x = 1; atomic_inc(u); smp_mb__after_atomic(); r0 = *y; Because, while the atomic_inc() implies memory order, it (surprisingly) does not provide a compiler barrier. This then allows the compiler to re-order like so: atomic_inc(u); *x = 1; smp_mb__after_atomic(); r0 = *y; Which the CPU is then allowed to re-order (under TSO rules) like: atomic_inc(u); r0 = *y; *x = 1; And this very much was not intended. Therefore strengthen the atomic RmW ops to include a compiler barrier. NOTE: atomic_{or,and,xor} and the bitops already had the compiler barrier. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-26x86/cpufeatures: Add FDP_EXCPTN_ONLY and ZERO_FCS_FDSAaron Lewis
[ Upstream commit cbb99c0f588737ec98c333558922ce47e9a95827 ] Add the CPUID enumeration for Intel's de-feature bits to accommodate passing these de-features through to kvm guests. These de-features are (from SDM vol 1, section 8.1.8): - X86_FEATURE_FDP_EXCPTN_ONLY: If CPUID.(EAX=07H,ECX=0H):EBX[bit 6] = 1, the data pointer (FDP) is updated only for the x87 non-control instructions that incur unmasked x87 exceptions. - X86_FEATURE_ZERO_FCS_FDS: If CPUID.(EAX=07H,ECX=0H):EBX[bit 13] = 1, the processor deprecates FCS and FDS; it saves each as 0000H. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jim Mattson <jmattson@google.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: marcorr@google.com Cc: Peter Feiner <pfeiner@google.com> Cc: pshier@google.com Cc: Robert Hoo <robert.hu@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190605220252.103406-1-aaronlewis@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-26x86/cpu: Add Ice Lake NNPI to Intel familyRajneesh Bhardwaj
[ Upstream commit e32d045cd4ba06b59878323e434bad010e78e658 ] Add the CPUID model number of Ice Lake Neural Network Processor for Deep Learning Inference (ICL-NNPI) to the Intel family list. Ice Lake NNPI uses model number 0x9D and this will be documented in a future version of Intel Software Development Manual. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: platform-driver-x86@vger.kernel.org Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Linux PM <linux-pm@vger.kernel.org> Link: https://lkml.kernel.org/r/20190606012419.13250-1-rajneesh.bhardwaj@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-21x86/irq: Seperate unused system vectors from spurious entry againThomas Gleixner
commit f8a8fe61fec8006575699559ead88b0b833d5cad upstream. Quite some time ago the interrupt entry stubs for unused vectors in the system vector range got removed and directly mapped to the spurious interrupt vector entry point. Sounds reasonable, but it's subtly broken. The spurious interrupt vector entry point pushes vector number 0xFF on the stack which makes the whole logic in __smp_spurious_interrupt() pointless. As a consequence any spurious interrupt which comes from a vector != 0xFF is treated as a real spurious interrupt (vector 0xFF) and not acknowledged. That subsequently stalls all interrupt vectors of equal and lower priority, which brings the system to a grinding halt. This can happen because even on 64-bit the system vector space is not guaranteed to be fully populated. A full compile time handling of the unused vectors is not possible because quite some of them are conditonally populated at runtime. Bring the entry stubs back, which wastes 160 bytes if all stubs are unused, but gains the proper handling back. There is no point to selectively spare some of the stubs which are known at compile time as the required code in the IDT management would be way larger and convoluted. Do not route the spurious entries through common_interrupt and do_IRQ() as the original code did. Route it to smp_spurious_interrupt() which evaluates the vector number and acts accordingly now that the real vector numbers are handed in. Fixup the pr_warn so the actual spurious vector (0xff) is clearly distiguished from the other vectors and also note for the vectored case whether it was pending in the ISR or not. "Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen." "Spurious interrupt vector 0xed on CPU#1. Acked." "Spurious interrupt vector 0xee on CPU#1. Not pending!." Fixes: 2414e021ac8d ("x86: Avoid building unused IRQ entry stubs") Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Jan Beulich <jbeulich@suse.com> Link: https://lkml.kernel.org/r/20190628111440.550568228@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-21x86/irq: Handle spurious interrupt after shutdown gracefullyThomas Gleixner
commit b7107a67f0d125459fe41f86e8079afd1a5e0b15 upstream. Since the rework of the vector management, warnings about spurious interrupts have been reported. Robert provided some more information and did an initial analysis. The following situation leads to these warnings: CPU 0 CPU 1 IO_APIC interrupt is raised sent to CPU1 Unable to handle immediately (interrupts off, deep idle delay) mask() ... free() shutdown() synchronize_irq() clear_vector() do_IRQ() -> vector is clear Before the rework the vector entries of legacy interrupts were statically assigned and occupied precious vector space while most of them were unused. Due to that the above situation was handled silently because the vector was handled and the core handler of the assigned interrupt descriptor noticed that it is shut down and returned. While this has been usually observed with legacy interrupts, this situation is not limited to them. Any other interrupt source, e.g. MSI, can cause the same issue. After adding proper synchronization for level triggered interrupts, this can only happen for edge triggered interrupts where the IO-APIC obviously cannot provide information about interrupts in flight. While the spurious warning is actually harmless in this case it worries users and driver developers. Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead of VECTOR_UNUSED when the vector is freed up. If that above late handling happens the spurious detector will not complain and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on that line will trigger the spurious warning as before. Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>- Tested-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20190628111440.459647741@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>