diff options
Diffstat (limited to 'common/recipes-kernel/linux/linux-yocto-4.9.21/0047-x86-Documentation-Add-PTI-description.patch')
-rw-r--r-- | common/recipes-kernel/linux/linux-yocto-4.9.21/0047-x86-Documentation-Add-PTI-description.patch | 267 |
1 files changed, 0 insertions, 267 deletions
diff --git a/common/recipes-kernel/linux/linux-yocto-4.9.21/0047-x86-Documentation-Add-PTI-description.patch b/common/recipes-kernel/linux/linux-yocto-4.9.21/0047-x86-Documentation-Add-PTI-description.patch deleted file mode 100644 index bd399062..00000000 --- a/common/recipes-kernel/linux/linux-yocto-4.9.21/0047-x86-Documentation-Add-PTI-description.patch +++ /dev/null @@ -1,267 +0,0 @@ -From 3a2bc0721f7a7cb408570b01508a581ef69a2aac Mon Sep 17 00:00:00 2001 -From: Dave Hansen <dave.hansen@linux.intel.com> -Date: Fri, 5 Jan 2018 09:44:36 -0800 -Subject: [PATCH 047/103] x86/Documentation: Add PTI description - -commit 01c9b17bf673b05bb401b76ec763e9730ccf1376 upstream. - -Add some details about how PTI works, what some of the downsides -are, and how to debug it when things go wrong. - -Also document the kernel parameter: 'pti/nopti'. - -Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> -Signed-off-by: Thomas Gleixner <tglx@linutronix.de> -Reviewed-by: Randy Dunlap <rdunlap@infradead.org> -Reviewed-by: Kees Cook <keescook@chromium.org> -Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at> -Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at> -Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at> -Cc: Richard Fellner <richard.fellner@student.tugraz.at> -Cc: Andy Lutomirski <luto@kernel.org> -Cc: Linus Torvalds <torvalds@linux-foundation.org> -Cc: Hugh Dickins <hughd@google.com> -Cc: Andi Lutomirsky <luto@kernel.org> -Cc: stable@vger.kernel.org -Link: https://lkml.kernel.org/r/20180105174436.1BC6FA2B@viggo.jf.intel.com -Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> ---- - Documentation/kernel-parameters.txt | 21 ++-- - Documentation/x86/pti.txt | 186 ++++++++++++++++++++++++++++++++++++ - 2 files changed, 200 insertions(+), 7 deletions(-) - create mode 100644 Documentation/x86/pti.txt - -diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt -index 9f04c53..3d53778 100644 ---- a/Documentation/kernel-parameters.txt -+++ b/Documentation/kernel-parameters.txt -@@ -2753,8 +2753,6 @@ bytes respectively. Such letter suffixes can also be entirely omitted. - - nojitter [IA-64] Disables jitter checking for ITC timers. - -- nopti [X86-64] Disable KAISER isolation of kernel from user. -- - no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver - - no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page -@@ -3317,11 +3315,20 @@ bytes respectively. Such letter suffixes can also be entirely omitted. - pt. [PARIDE] - See Documentation/blockdev/paride.txt. - -- pti= [X86_64] -- Control KAISER user/kernel address space isolation: -- on - enable -- off - disable -- auto - default setting -+ pti= [X86_64] Control Page Table Isolation of user and -+ kernel address spaces. Disabling this feature -+ removes hardening, but improves performance of -+ system calls and interrupts. -+ -+ on - unconditionally enable -+ off - unconditionally disable -+ auto - kernel detects whether your CPU model is -+ vulnerable to issues that PTI mitigates -+ -+ Not specifying this option is equivalent to pti=auto. -+ -+ nopti [X86_64] -+ Equivalent to pti=off - - pty.legacy_count= - [KNL] Number of legacy pty's. Overwrites compiled-in -diff --git a/Documentation/x86/pti.txt b/Documentation/x86/pti.txt -new file mode 100644 -index 0000000..d11eff6 ---- /dev/null -+++ b/Documentation/x86/pti.txt -@@ -0,0 +1,186 @@ -+Overview -+======== -+ -+Page Table Isolation (pti, previously known as KAISER[1]) is a -+countermeasure against attacks on the shared user/kernel address -+space such as the "Meltdown" approach[2]. -+ -+To mitigate this class of attacks, we create an independent set of -+page tables for use only when running userspace applications. When -+the kernel is entered via syscalls, interrupts or exceptions, the -+page tables are switched to the full "kernel" copy. When the system -+switches back to user mode, the user copy is used again. -+ -+The userspace page tables contain only a minimal amount of kernel -+data: only what is needed to enter/exit the kernel such as the -+entry/exit functions themselves and the interrupt descriptor table -+(IDT). There are a few strictly unnecessary things that get mapped -+such as the first C function when entering an interrupt (see -+comments in pti.c). -+ -+This approach helps to ensure that side-channel attacks leveraging -+the paging structures do not function when PTI is enabled. It can be -+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. -+Once enabled at compile-time, it can be disabled at boot with the -+'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). -+ -+Page Table Management -+===================== -+ -+When PTI is enabled, the kernel manages two sets of page tables. -+The first set is very similar to the single set which is present in -+kernels without PTI. This includes a complete mapping of userspace -+that the kernel can use for things like copy_to_user(). -+ -+Although _complete_, the user portion of the kernel page tables is -+crippled by setting the NX bit in the top level. This ensures -+that any missed kernel->user CR3 switch will immediately crash -+userspace upon executing its first instruction. -+ -+The userspace page tables map only the kernel data needed to enter -+and exit the kernel. This data is entirely contained in the 'struct -+cpu_entry_area' structure which is placed in the fixmap which gives -+each CPU's copy of the area a compile-time-fixed virtual address. -+ -+For new userspace mappings, the kernel makes the entries in its -+page tables like normal. The only difference is when the kernel -+makes entries in the top (PGD) level. In addition to setting the -+entry in the main kernel PGD, a copy of the entry is made in the -+userspace page tables' PGD. -+ -+This sharing at the PGD level also inherently shares all the lower -+layers of the page tables. This leaves a single, shared set of -+userspace page tables to manage. One PTE to lock, one set of -+accessed bits, dirty bits, etc... -+ -+Overhead -+======== -+ -+Protection against side-channel attacks is important. But, -+this protection comes at a cost: -+ -+1. Increased Memory Use -+ a. Each process now needs an order-1 PGD instead of order-0. -+ (Consumes an additional 4k per process). -+ b. The 'cpu_entry_area' structure must be 2MB in size and 2MB -+ aligned so that it can be mapped by setting a single PMD -+ entry. This consumes nearly 2MB of RAM once the kernel -+ is decompressed, but no space in the kernel image itself. -+ -+2. Runtime Cost -+ a. CR3 manipulation to switch between the page table copies -+ must be done at interrupt, syscall, and exception entry -+ and exit (it can be skipped when the kernel is interrupted, -+ though.) Moves to CR3 are on the order of a hundred -+ cycles, and are required at every entry and exit. -+ b. A "trampoline" must be used for SYSCALL entry. This -+ trampoline depends on a smaller set of resources than the -+ non-PTI SYSCALL entry code, so requires mapping fewer -+ things into the userspace page tables. The downside is -+ that stacks must be switched at entry time. -+ d. Global pages are disabled for all kernel structures not -+ mapped into both kernel and userspace page tables. This -+ feature of the MMU allows different processes to share TLB -+ entries mapping the kernel. Losing the feature means more -+ TLB misses after a context switch. The actual loss of -+ performance is very small, however, never exceeding 1%. -+ d. Process Context IDentifiers (PCID) is a CPU feature that -+ allows us to skip flushing the entire TLB when switching page -+ tables by setting a special bit in CR3 when the page tables -+ are changed. This makes switching the page tables (at context -+ switch, or kernel entry/exit) cheaper. But, on systems with -+ PCID support, the context switch code must flush both the user -+ and kernel entries out of the TLB. The user PCID TLB flush is -+ deferred until the exit to userspace, minimizing the cost. -+ See intel.com/sdm for the gory PCID/INVPCID details. -+ e. The userspace page tables must be populated for each new -+ process. Even without PTI, the shared kernel mappings -+ are created by copying top-level (PGD) entries into each -+ new process. But, with PTI, there are now *two* kernel -+ mappings: one in the kernel page tables that maps everything -+ and one for the entry/exit structures. At fork(), we need to -+ copy both. -+ f. In addition to the fork()-time copying, there must also -+ be an update to the userspace PGD any time a set_pgd() is done -+ on a PGD used to map userspace. This ensures that the kernel -+ and userspace copies always map the same userspace -+ memory. -+ g. On systems without PCID support, each CR3 write flushes -+ the entire TLB. That means that each syscall, interrupt -+ or exception flushes the TLB. -+ h. INVPCID is a TLB-flushing instruction which allows flushing -+ of TLB entries for non-current PCIDs. Some systems support -+ PCIDs, but do not support INVPCID. On these systems, addresses -+ can only be flushed from the TLB for the current PCID. When -+ flushing a kernel address, we need to flush all PCIDs, so a -+ single kernel address flush will require a TLB-flushing CR3 -+ write upon the next use of every PCID. -+ -+Possible Future Work -+==================== -+1. We can be more careful about not actually writing to CR3 -+ unless its value is actually changed. -+2. Allow PTI to be enabled/disabled at runtime in addition to the -+ boot-time switching. -+ -+Testing -+======== -+ -+To test stability of PTI, the following test procedure is recommended, -+ideally doing all of these in parallel: -+ -+1. Set CONFIG_DEBUG_ENTRY=y -+2. Run several copies of all of the tools/testing/selftests/x86/ tests -+ (excluding MPX and protection_keys) in a loop on multiple CPUs for -+ several minutes. These tests frequently uncover corner cases in the -+ kernel entry code. In general, old kernels might cause these tests -+ themselves to crash, but they should never crash the kernel. -+3. Run the 'perf' tool in a mode (top or record) that generates many -+ frequent performance monitoring non-maskable interrupts (see "NMI" -+ in /proc/interrupts). This exercises the NMI entry/exit code which -+ is known to trigger bugs in code paths that did not expect to be -+ interrupted, including nested NMIs. Using "-c" boosts the rate of -+ NMIs, and using two -c with separate counters encourages nested NMIs -+ and less deterministic behavior. -+ -+ while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done -+ -+4. Launch a KVM virtual machine. -+5. Run 32-bit binaries on systems supporting the SYSCALL instruction. -+ This has been a lightly-tested code path and needs extra scrutiny. -+ -+Debugging -+========= -+ -+Bugs in PTI cause a few different signatures of crashes -+that are worth noting here. -+ -+ * Failures of the selftests/x86 code. Usually a bug in one of the -+ more obscure corners of entry_64.S -+ * Crashes in early boot, especially around CPU bringup. Bugs -+ in the trampoline code or mappings cause these. -+ * Crashes at the first interrupt. Caused by bugs in entry_64.S, -+ like screwing up a page table switch. Also caused by -+ incorrectly mapping the IRQ handler entry code. -+ * Crashes at the first NMI. The NMI code is separate from main -+ interrupt handlers and can have bugs that do not affect -+ normal interrupts. Also caused by incorrectly mapping NMI -+ code. NMIs that interrupt the entry code must be very -+ careful and can be the cause of crashes that show up when -+ running perf. -+ * Kernel crashes at the first exit to userspace. entry_64.S -+ bugs, or failing to map some of the exit code. -+ * Crashes at first interrupt that interrupts userspace. The paths -+ in entry_64.S that return to userspace are sometimes separate -+ from the ones that return to the kernel. -+ * Double faults: overflowing the kernel stack because of page -+ faults upon page faults. Caused by touching non-pti-mapped -+ data in the entry code, or forgetting to switch to kernel -+ CR3 before calling into C functions which are not pti-mapped. -+ * Userspace segfaults early in boot, sometimes manifesting -+ as mount(8) failing to mount the rootfs. These have -+ tended to be TLB invalidation issues. Usually invalidating -+ the wrong PCID, or otherwise missing an invalidation. -+ -+1. https://gruss.cc/files/kaiser.pdf -+2. https://meltdownattack.com/meltdown.pdf --- -2.7.4 - |