aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/include/asm/msr-index.h
AgeCommit message (Collapse)Author
2020-10-14Merge tag 'x86_seves_for_v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV-ES support from Borislav Petkov: "SEV-ES enhances the current guest memory encryption support called SEV by also encrypting the guest register state, making the registers inaccessible to the hypervisor by en-/decrypting them on world switches. Thus, it adds additional protection to Linux guests against exfiltration, control flow and rollback attacks. With SEV-ES, the guest is in full control of what registers the hypervisor can access. This is provided by a guest-host exchange mechanism based on a new exception vector called VMM Communication Exception (#VC), a new instruction called VMGEXIT and a shared Guest-Host Communication Block which is a decrypted page shared between the guest and the hypervisor. Intercepts to the hypervisor become #VC exceptions in an SEV-ES guest so in order for that exception mechanism to work, the early x86 init code needed to be made able to handle exceptions, which, in itself, brings a bunch of very nice cleanups and improvements to the early boot code like an early page fault handler, allowing for on-demand building of the identity mapping. With that, !KASLR configurations do not use the EFI page table anymore but switch to a kernel-controlled one. The main part of this series adds the support for that new exchange mechanism. The goal has been to keep this as much as possibly separate from the core x86 code by concentrating the machinery in two SEV-ES-specific files: arch/x86/kernel/sev-es-shared.c arch/x86/kernel/sev-es.c Other interaction with core x86 code has been kept at minimum and behind static keys to minimize the performance impact on !SEV-ES setups. Work by Joerg Roedel and Thomas Lendacky and others" * tag 'x86_seves_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits) x86/sev-es: Use GHCB accessor for setting the MMIO scratch buffer x86/sev-es: Check required CPU features for SEV-ES x86/efi: Add GHCB mappings when SEV-ES is active x86/sev-es: Handle NMI State x86/sev-es: Support CPU offline/online x86/head/64: Don't call verify_cpu() on starting APs x86/smpboot: Load TSS and getcpu GDT entry before loading IDT x86/realmode: Setup AP jump table x86/realmode: Add SEV-ES specific trampoline entry point x86/vmware: Add VMware-specific handling for VMMCALL under SEV-ES x86/kvm: Add KVM-specific VMMCALL handling under SEV-ES x86/paravirt: Allow hypervisor-specific VMMCALL handling under SEV-ES x86/sev-es: Handle #DB Events x86/sev-es: Handle #AC Events x86/sev-es: Handle VMMCALL Events x86/sev-es: Handle MWAIT/MWAITX Events x86/sev-es: Handle MONITOR/MONITORX Events x86/sev-es: Handle INVD Events x86/sev-es: Handle RDPMC Events x86/sev-es: Handle RDTSC(P) Events ...
2020-10-12Merge tag 'perf-core-2020-10-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "x86 Intel updates: - Add Jasper Lake support - Add support for TopDown metrics on Ice Lake - Fix Ice Lake & Tiger Lake uncore support, add Snow Ridge support - Add a PCI sub driver to support uncore PMUs where the PCI resources have been claimed already - extending the range of supported systems. x86 AMD updates: - Restore 'perf stat -a' behaviour to program the uncore PMU to count all CPU threads. - Fix setting the proper count when sampling Large Increment per Cycle events / 'paired' events. - Fix IBS Fetch sampling on F17h and some other IBS fine tuning, greatly reducing the number of interrupts when large sample periods are specified. - Extends Family 17h RAPL support to also work on compatible F19h machines. Core code updates: - Fix race in perf_mmap_close() - Add PERF_EV_CAP_SIBLING, to denote that sibling events should be closed if the leader is removed. - Smaller fixes and updates" * tag 'perf-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) perf/core: Fix race in the perf_mmap_close() function perf/x86: Fix n_metric for cancelled txn perf/x86: Fix n_pair for cancelled txn x86/events/amd/iommu: Fix sizeof mismatch perf/x86/intel: Check perf metrics feature for each CPU perf/x86/intel: Fix Ice Lake event constraint table perf/x86/intel/uncore: Fix the scale of the IMC free-running events perf/x86/intel/uncore: Fix for iio mapping on Skylake Server perf/x86/msr: Add Jasper Lake support perf/x86/intel: Add Jasper Lake support perf/x86/intel/uncore: Reduce the number of CBOX counters perf/x86/intel/uncore: Update Ice Lake uncore units perf/x86/intel/uncore: Split the Ice Lake and Tiger Lake MSR uncore support perf/x86/intel/uncore: Support PCIe3 unit on Snow Ridge perf/x86/intel/uncore: Generic support for the PCI sub driver perf/x86/intel/uncore: Factor out uncore_pci_pmu_unregister() perf/x86/intel/uncore: Factor out uncore_pci_pmu_register() perf/x86/intel/uncore: Factor out uncore_pci_find_dev_pmu() perf/x86/intel/uncore: Factor out uncore_pci_get_dev_die_info() perf/amd/uncore: Inform the user how many counters each uncore PMU has ...
2020-09-17x86/msr-index: Define an IA32_PASID MSRFenghua Yu
The IA32_PASID MSR (0xd93) contains the Process Address Space Identifier (PASID), a 20-bit value. Bit 31 must be set to indicate the value programmed in the MSR is valid. Hardware uses the PASID to identify a process address space and direct responses to the right address space. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/1600187413-163670-7-git-send-email-fenghua.yu@intel.com
2020-09-10perf/x86/amd/ibs: Fix raw sample data accumulationKim Phillips
Neither IbsBrTarget nor OPDATA4 are populated in IBS Fetch mode. Don't accumulate them into raw sample user data in that case. Also, in Fetch mode, add saving the IBS Fetch Control Extended MSR. Technically, there is an ABI change here with respect to the IBS raw sample data format, but I don't see any perf driver version information being included in perf.data file headers, but, existing users can detect whether the size of the sample record has reduced by 8 bytes to determine whether the IBS driver has this fix. Fixes: 904cb3677f3a ("perf/x86/amd/ibs: Update IBS MSRs and feature definitions") Reported-by: Stephane Eranian <stephane.eranian@google.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200908214740.18097-6-kim.phillips@amd.com
2020-09-07x86/sev-es: Add SEV-ES Feature DetectionJoerg Roedel
Add a sev_es_active() function for checking whether SEV-ES is enabled. Also cache the value of MSR_AMD64_SEV at boot to speed up the feature checking in the running code. [ bp: Remove "!!" in sev_active() too. ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200907131613.12703-37-joro@8bytes.org
2020-09-07x86/boot/compressed/64: Add stage1 #VC handlerJoerg Roedel
Add the first handler for #VC exceptions. At stage 1 there is no GHCB yet because the kernel might still be running on the EFI page table. The stage 1 handler is limited to the MSR-based protocol to talk to the hypervisor and can only support CPUID exit-codes, but that is enough to get to stage 2. [ bp: Zap superfluous newlines after rd/wrmsr instruction mnemonics. ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200907131613.12703-20-joro@8bytes.org
2020-08-18perf/x86/intel: Support TopDown metrics on Ice LakeKan Liang
Ice Lake supports the hardware TopDown metrics feature, which can free up the scarce GP counters. Update the event constraints for the metrics events. The metric counters do not exist, which are mapped to a dummy offset. The sharing between multiple users of the same metric without multiplexing is not allowed. Implement set_topdown_event_period for Ice Lake. The values in PERF_METRICS MSR are derived from the fixed counter 3. Both registers should start from zero. Implement update_topdown_event for Ice Lake. The metric is reported by multiplying the metric (fraction) with slots. To maintain accurate measurements, both registers are cleared for each update. The fixed counter 3 should always be cleared before the PERF_METRICS. Implement td_attr for the new metrics events and the new slots fixed counter. Make them visible to the perf user tools. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-11-kan.liang@linux.intel.com
2020-08-18perf/x86/intel: Generic support for hardware TopDown metricsKan Liang
Intro ===== The TopDown Microarchitecture Analysis (TMA) Method is a structured analysis methodology to identify critical performance bottlenecks in out-of-order processors. Current perf has supported the method. The method works well, but there is one problem. To collect the TopDown events, several GP counters have to be used. If a user wants to collect other events at the same time, the multiplexing probably be triggered, which impacts the accuracy. To free up the scarce GP counters, the hardware TopDown metrics feature is introduced from Ice Lake. The hardware implements an additional "metrics" register and a new Fixed Counter 3 that measures pipeline "slots". The TopDown events can be calculated from them instead. Events ====== The level 1 TopDown has four metrics. There is no event-code assigned to the TopDown metrics. Four metric events are exported as separate perf events, which map to the internal "metrics" counter register. Those events do not exist in hardware, but can be allocated by the scheduler. For the event mapping, a special 0x00 event code is used, which is reserved for fake events. The metric events start from umask 0x10. When setting up the metric events, they point to the Fixed Counter 3. They have to be specially handled. - Add the update_topdown_event() callback to read the additional metrics MSR and generate the metrics. - Add the set_topdown_event_period() callback to initialize metrics MSR and the fixed counter 3. - Add a variable n_metric_event to track the number of the accepted metrics events. The sharing between multiple users of the same metric without multiplexing is not allowed. - Only enable/disable the fixed counter 3 when there are no other active TopDown events, which avoid the unnecessary writing of the fixed control register. - Disable the PMU when reading the metrics event. The metrics MSR and the fixed counter 3 are read separately. The values may be modified by an NMI. All four metric events don't support sampling. Since they will be handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is introduced to indicate this case. The slots event can support both sampling and counting. For counting, the flag is also applied. For sampling, it will be handled normally as other normal events. Groups ====== The slots event is required in a Topdown group. To avoid reading the METRICS register multiple times, the metrics and slots value can only be updated by slots event in a group. All active slots and metrics events will be updated one time. Therefore, the slots event must be before any metric events in a Topdown group. NMI ====== The METRICS related register may be overflow. The bit 48 of the STATUS register will be set. If so, PERF_METRICS and Fixed counter 3 are required to be reset. The patch also update all active slots and metrics events in the NMI handler. The update_topdown_event() has to read two registers separately. The values may be modified by an NMI. PMU has to be disabled before calling the function. RDPMC ====== RDPMC is temporarily disabled. A later patch will enable it. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-9-kan.liang@linux.intel.com
2020-08-03Merge tag 'pm-5.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "The most significant change here is the extension of the Energy Model to cover non-CPU devices (as well as CPUs) from Lukasz Luba. There is also some new hardware support (Ice Lake server idle states table for intel_idle, Sapphire Rapids and Power Limit 4 support in the RAPL driver), some new functionality in the existing drivers (eg. a new switch to disable/enable CPU energy-efficiency optimizations in intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci, devfreq), including the elimination of W=1 build warnings from cpufreq done by Lee Jones. Specifics: - Make the Energy Model cover non-CPU devices (Lukasz Luba). - Add Ice Lake server idle states table to the intel_idle driver and eliminate a redundant static variable from it (Chen Yu, Rafael Wysocki). - Eliminate all W=1 build warnings from cpufreq (Lee Jones). - Add support for Sapphire Rapids and for Power Limit 4 to the Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui). - Fix function name in kerneldoc comments in the idle_inject power capping driver (Yangtao Li). - Fix locking issues with cpufreq governors and drop a redundant "weak" function definition from cpufreq (Viresh Kumar). - Rearrange cpufreq to register non-modular governors at the core_initcall level and allow the default cpufreq governor to be specified in the kernel command line (Quentin Perret). - Extend, fix and clean up the intel_pstate driver (Srinivas Pandruvada, Rafael Wysocki): * Add a new sysfs attribute for disabling/enabling CPU energy-efficiency optimizations in the processor. * Make the driver avoid enabling HWP if EPP is not supported. * Allow the driver to handle numeric EPP values in the sysfs interface and fix the setting of EPP via sysfs in the active mode. * Eliminate a static checker warning and clean up a kerneldoc comment. - Clean up some variable declarations in the powernv cpufreq driver (Wei Yongjun). - Fix up the ->enter_s2idle callback definition to cover the case when it points to the same function as ->idle correctly (Neal Liu). - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson). - Make the PM core emit "changed" uevent when adding/removing the "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi). - Add a helper macro for declaring PM callbacks and use it in the MMC jz4740 driver (Paul Cercueil). - Fix white space in some places in the hibernate code and make the system-wide PM code use "const char *" where appropriate (Xiang Chen, Alexey Dobriyan). - Add one more "unsafe" helper macro to the freezer to cover the NFS use case (He Zhe). - Change the language in the generic PM domains framework to use parent/child terminology and clean up a typo and some comment fromatting in that code (Kees Cook, Geert Uytterhoeven). - Update the operating performance points OPP framework (Lukasz Luba, Andrew-sh.Cheng, Valdis Kletnieks): * Refactor dev_pm_opp_of_register_em() and update related drivers. * Add a missing function export. * Allow disabled OPPs in dev_pm_opp_get_freq(). - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier): * Add support for delayed timers to the devfreq core and make the Samsung exynos5422-dmc driver use it. * Unify sysfs interface to use "df-" as a prefix in instance names consistently. * Fix devfreq_summary debugfs node indentation. * Add the rockchip,pmu phandle to the rk3399_dmc driver DT bindings. * List Dmitry Osipenko as the Tegra devfreq driver maintainer. * Fix typos in the core devfreq code. - Update the pm-graph utility to version 5.7 including a number of fixes related to suspend-to-idle (Todd Brandt). - Fix coccicheck errors and warnings in the cpupower utility (Shuah Khan). - Replace HTTP links with HTTPs ones in multiple places (Alexander A. Klimov)" * tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits) cpuidle: ACPI: fix 'return' with no value build warning cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode cpufreq: intel_pstate: Rearrange the storing of new EPP values intel_idle: Customize IceLake server support PM / devfreq: Fix the wrong end with semicolon PM / devfreq: Fix indentaion of devfreq_summary debugfs node PM / devfreq: Clean up the devfreq instance name in sysfs attr memory: samsung: exynos5422-dmc: Add module param to control IRQ mode memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold memory: samsung: exynos5422-dmc: Use delayed timer as default PM / devfreq: Add support delayed timer for polling mode dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle PM / devfreq: tegra: Add Dmitry as a maintainer PM / devfreq: event: Fix trivial spelling PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent cpuidle: change enter_s2idle() prototype cpuidle: psci: Prevent domain idlestates until consumers are ready cpuidle: psci: Convert PM domain to platform driver cpuidle: psci: Fix error path via converting to a platform driver cpuidle: psci: Fail cpuidle registration if set OSI mode failed ...
2020-08-03Merge tag 'x86-cleanups-2020-08-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups all around the place" * tag 'x86-cleanups-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioperm: Initialize pointer bitmap with NULL rather than 0 x86: uv: uv_hub.h: Delete duplicated word x86: cmpxchg_32.h: Delete duplicated word x86: bootparam.h: Delete duplicated word x86/mm: Remove the unused mk_kernel_pgd() #define x86/tsc: Remove unused "US_SCALE" and "NS_SCALE" leftover macros x86/ioapic: Remove unused "IOAPIC_AUTO" define x86/mm: Drop unused MAX_PHYSADDR_BITS x86/msr: Move the F15h MSRs where they belong x86/idt: Make idt_descr static initrd: Remove erroneous comment x86/mm/32: Fix -Wmissing prototypes warnings for init.c cpu/speculation: Add prototype for cpu_show_srbds() x86/mm: Fix -Wmissing-prototypes warnings for arch/x86/mm/init.c x86/asm: Unify __ASSEMBLY__ blocks x86/cpufeatures: Mark two free bits in word 3 x86/msr: Lift AMD family 0x15 power-specific MSRs
2020-07-08x86/msr-index: Add bunch of MSRs for Arch LBRKan Liang
Add Arch LBR related MSRs and the new LBR INFO bits in MSR-index. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1593780569-62993-8-git-send-email-kan.liang@linux.intel.com
2020-07-02cpufreq: intel_pstate: Allow enable/disable energy efficiencySrinivas Pandruvada
By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-22x86/msr: Move the F15h MSRs where they belongBorislav Petkov
1068ed4547ad ("x86/msr: Lift AMD family 0x15 power-specific MSRs") moved the three F15h power MSRs to the architectural list but that was wrong as they belong in the family 0x15 list. That also caused: In file included from trace/beauty/tracepoints/x86_msr.c:10: perf/trace/beauty/generated/x86_arch_MSRs_array.c:292:45: error: initialized field overwritten [-Werror=override-init] 292 | [0xc0010280 - x86_AMD_V_KVM_MSRs_offset] = "F15H_PTSC", | ^~~~~~~~~~~ perf/trace/beauty/generated/x86_arch_MSRs_array.c:292:45: note: (near initialization for 'x86_AMD_V_KVM_MSRs[640]') due to MSR_F15H_PTSC ending up being defined twice. Move them where they belong and drop the duplicate. Also, drop the respective tools/ changes of the msr-index.h copy the above commit added because perf tool developers prefer to go through those changes themselves in order to figure out whether changes to the kernel headers would need additional handling in perf. Fixes: 1068ed4547ad ("x86/msr: Lift AMD family 0x15 power-specific MSRs") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lkml.kernel.org/r/20200621163323.14e8533f@canb.auug.org.au
2020-06-15x86/msr: Lift AMD family 0x15 power-specific MSRsBorislav Petkov
... into the global msr-index.h header because they're used in multiple compilation units. Sort the MSR list a bit. Update the msr-index.h copy in tools. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Guenter Roeck <linux@roeck-us.net> Link: https://lkml.kernel.org/r/20200608164847.14232-1-bp@alien8.de
2020-06-09Merge branch 'x86/srbds' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 srbds fixes from Thomas Gleixner: "The 9th episode of the dime novel "The performance killer" with the subtitle "Slow Randomizing Boosts Denial of Service". SRBDS is an MDS-like speculative side channel that can leak bits from the random number generator (RNG) across cores and threads. New microcode serializes the processor access during the execution of RDRAND and RDSEED. This ensures that the shared buffer is overwritten before it is released for reuse. This is equivalent to a full bus lock, which means that many threads running the RNG instructions in parallel have the same effect as the same amount of threads issuing a locked instruction targeting an address which requires locking of two cachelines at once. The mitigation support comes with the usual pile of unpleasant ingredients: - command line options - sysfs file - microcode checks - a list of vulnerable CPUs identified by model and stepping this time which requires stepping match support for the cpu match logic. - the inevitable slowdown of affected CPUs" * branch 'x86/srbds' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Add Ivy Bridge to affected list x86/speculation: Add SRBDS vulnerability and mitigation documentation x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigation x86/cpu: Add 'table' argument to cpu_matches()
2020-05-28perf/x86/rapl: Add AMD Fam17h RAPL supportStephane Eranian
This patch enables AMD Fam17h RAPL support for the Package level metric. The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh (Zen1) PPR. The same output is available via the energy-pkg pseudo event: $ perf stat -a -I 1000 --per-socket -e power/energy-pkg/ Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200527224659.206129-6-eranian@google.com
2020-04-20x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigationMark Gross
SRBDS is an MDS-like speculative side channel that can leak bits from the random number generator (RNG) across cores and threads. New microcode serializes the processor access during the execution of RDRAND and RDSEED. This ensures that the shared buffer is overwritten before it is released for reuse. While it is present on all affected CPU models, the microcode mitigation is not needed on models that enumerate ARCH_CAPABILITIES[MDS_NO] in the cases where TSX is not supported or has been disabled with TSX_CTRL. The mitigation is activated by default on affected processors and it increases latency for RDRAND and RDSEED instructions. Among other effects this will reduce throughput from /dev/urandom. * Enable administrator to configure the mitigation off when desired using either mitigations=off or srbds=off. * Export vulnerability status via sysfs * Rename file-scoped macros to apply for non-whitelist table initializations. [ bp: Massage, - s/VULNBL_INTEL_STEPPING/VULNBL_INTEL_STEPPINGS/g, - do not read arch cap MSR a second time in tsx_fused_off() - just pass it in, - flip check in cpu_set_bug_bits() to save an indentation level, - reflow comments. jpoimboe: s/Mitigated/Mitigation/ in user-visible strings tglx: Dropped the fused off magic for now ] Signed-off-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
2020-03-30Merge tag 'x86-splitlock-2020-03-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 splitlock updates from Thomas Gleixner: "Support for 'split lock' detection: Atomic operations (lock prefixed instructions) which span two cache lines have to acquire the global bus lock. This is at least 1k cycles slower than an atomic operation within a cache line and disrupts performance on other cores. Aside of performance disruption this is a unpriviledged form of DoS. Some newer CPUs have the capability to raise an #AC trap when such an operation is attempted. The detection is by default enabled in warning mode which will warn once when a user space application is caught. A command line option allows to disable the detection or to select fatal mode which will terminate offending applications with SIGBUS" * tag 'x86-splitlock-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/split_lock: Avoid runtime reads of the TEST_CTRL MSR x86/split_lock: Rework the initialization flow of split lock detection x86/split_lock: Enable split lock detection by kernel
2020-02-20x86/split_lock: Enable split lock detection by kernelPeter Zijlstra (Intel)
A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
2020-02-19x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERFKim Phillips
Commit aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter") added support for access to the free-running counter via 'perf -e msr/irperf/', but when exercised, it always returns a 0 count: BEFORE: $ perf stat -e instructions,msr/irperf/ true Performance counter stats for 'true': 624,833 instructions 0 msr/irperf/ Simply set its enable bit - HWCR bit 30 - to make it start counting. Enablement is restricted to all machines advertising IRPERF capability, except those susceptible to an erratum that makes the IRPERF return bad values. That erratum occurs in Family 17h models 00-1fh [1], but not in F17h models 20h and above [2]. AFTER (on a family 17h model 31h machine): $ perf stat -e instructions,msr/irperf/ true Performance counter stats for 'true': 621,690 instructions 622,490 msr/irperf/ [1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors [2] Revision Guide for AMD Family 17h Models 30h-3Fh Processors The revision guides are available from the bugzilla Link below. [ bp: Massage commit message. ] Fixes: aaf248848db50 ("perf/x86/msr: Add AMD IRPERF (Instructions Retired) performance counter") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Link: http://lkml.kernel.org/r/20200214201805.13830-1-kim.phillips@amd.com
2020-01-13x86/msr-index: Clean up bit defines for IA32_FEATURE_CONTROL MSRSean Christopherson
As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL are quite a mouthful, especially the VMX bits which must differentiate between enabling VMX inside and outside SMX (TXT) operation. Rename the MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to make them a little friendlier on the eyes. Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name to match Intel's SDM, but a future patch will add a dedicated Kconfig, file and functions for the MSR. Using the full name for those assets is rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its nomenclature is consistent throughout the kernel. Opportunistically, fix a few other annoyances with the defines: - Relocate the bit defines so that they immediately follow the MSR define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL. - Add whitespace around the block of feature control defines to make it clear they're all related. - Use BIT() instead of manually encoding the bit shift. - Use "VMX" instead of "VMXON" to match the SDM. - Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to be consistent with the kernel's verbiage used for all other feature control bits. Note, the SDM refers to the LMCE bit as LMCE_ON, likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN. Ignore the (literal) one-off usage of _ON, the SDM is simply "wrong". Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
2019-11-25Merge tag 'for-linus-5.5a-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - a small series to remove the build constraint of Xen x86 MCE handling to 64-bit only - a bunch of minor cleanups * tag 'for-linus-5.5a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: Fix Kconfig indentation xen/mcelog: also allow building for 32-bit kernels xen/mcelog: add PPIN to record when available xen/mcelog: drop __MC_MSR_MCGCAP xen/gntdev: Use select for DMA_SHARED_BUFFER xen: mm: make xen_mm_init static xen: mm: include <xen/xen-ops.h> for missing declarations
2019-11-14xen/mcelog: add PPIN to record when availableJan Beulich
This is to augment commit 3f5a7896a5 ("x86/mce: Include the PPIN in MCE records when available"). I'm also adding "synd" and "ipid" fields to struct xen_mce, in an attempt to keep field offsets in sync with struct mce. These two fields won't get populated for now, though. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-11-04x86/bugs: Add ITLB_MULTIHIT bug infrastructureVineela Tummalapalli
Some processors may incur a machine check error possibly resulting in an unrecoverable CPU lockup when an instruction fetch encounters a TLB multi-hit in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. The relevant erratum can be found here: https://bugzilla.kernel.org/show_bug.cgi?id=205195 There are other processors affected for which the erratum does not fully disclose the impact. This issue affects both bare-metal x86 page tables and EPT. It can be mitigated by either eliminating the use of large pages or by using careful TLB invalidations when changing the page size in the page tables. Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which are mitigated against this issue. Signed-off-by: Vineela Tummalapalli <vineela.tummalapalli@intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-10-28x86/speculation/taa: Add mitigation for TSX Async AbortPawan Gupta
TSX Async Abort (TAA) is a side channel vulnerability to the internal buffers in some Intel processors similar to Microachitectural Data Sampling (MDS). In this case, certain loads may speculatively pass invalid data to dependent operations when an asynchronous abort condition is pending in a TSX transaction. This includes loads with no fault or assist condition. Such loads may speculatively expose stale data from the uarch data structures as in MDS. Scope of exposure is within the same-thread and cross-thread. This issue affects all current processors that support TSX, but do not have ARCH_CAP_TAA_NO (bit 8) set in MSR_IA32_ARCH_CAPABILITIES. On CPUs which have their IA32_ARCH_CAPABILITIES MSR bit MDS_NO=0, CPUID.MD_CLEAR=1 and the MDS mitigation is clearing the CPU buffers using VERW or L1D_FLUSH, there is no additional mitigation needed for TAA. On affected CPUs with MDS_NO=1 this issue can be mitigated by disabling the Transactional Synchronization Extensions (TSX) feature. A new MSR IA32_TSX_CTRL in future and current processors after a microcode update can be used to control the TSX feature. There are two bits in that MSR: * TSX_CTRL_RTM_DISABLE disables the TSX sub-feature Restricted Transactional Memory (RTM). * TSX_CTRL_CPUID_CLEAR clears the RTM enumeration in CPUID. The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled with updated microcode but still enumerated as present by CPUID(EAX=7).EBX{bit4}. The second mitigation approach is similar to MDS which is clearing the affected CPU buffers on return to user space and when entering a guest. Relevant microcode update is required for the mitigation to work. More details on this approach can be found here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html The TSX feature can be controlled by the "tsx" command line parameter. If it is force-enabled then "Clear CPU buffers" (MDS mitigation) is deployed. The effective mitigation state can be read from sysfs. [ bp: - massage + comments cleanup - s/TAA_MITIGATION_TSX_DISABLE/TAA_MITIGATION_TSX_DISABLED/g - Josh. - remove partial TAA mitigation in update_mds_branch_idle() - Josh. - s/tsx_async_abort_cmdline/tsx_async_abort_parse_cmdline/g ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
2019-10-28x86/msr: Add the IA32_TSX_CTRL MSRPawan Gupta
Transactional Synchronization Extensions (TSX) may be used on certain processors as part of a speculative side channel attack. A microcode update for existing processors that are vulnerable to this attack will add a new MSR - IA32_TSX_CTRL to allow the system administrator the option to disable TSX as one of the possible mitigations. The CPUs which get this new MSR after a microcode upgrade are the ones which do not set MSR_IA32_ARCH_CAPABILITIES.MDS_NO (bit 5) because those CPUs have CPUID.MD_CLEAR, i.e., the VERW implementation which clears all CPU buffers takes care of the TAA case as well. [ Note that future processors that are not vulnerable will also support the IA32_TSX_CTRL MSR. ] Add defines for the new IA32_TSX_CTRL MSR and its bits. TSX has two sub-features: 1. Restricted Transactional Memory (RTM) is an explicitly-used feature where new instructions begin and end TSX transactions. 2. Hardware Lock Elision (HLE) is implicitly used when certain kinds of "old" style locks are used by software. Bit 7 of the IA32_ARCH_CAPABILITIES indicates the presence of the IA32_TSX_CTRL MSR. There are two control bits in IA32_TSX_CTRL MSR: Bit 0: When set, it disables the Restricted Transactional Memory (RTM) sub-feature of TSX (will force all transactions to abort on the XBEGIN instruction). Bit 1: When set, it disables the enumeration of the RTM and HLE feature (i.e. it will make CPUID(EAX=7).EBX{bit4} and CPUID(EAX=7).EBX{bit11} read as 0). The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled by the new microcode but still enumerated as present by CPUID(EAX=7).EBX{bit4}, unless disabled by IA32_TSX_CTRL_MSR[1] - TSX_CTRL_CPUID_CLEAR. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Mark Gross <mgross@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
2019-09-16Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu-feature updates from Ingo Molnar: - Rework the Intel model names symbols/macros, which were decades of ad-hoc extensions and added random noise. It's now a coherent, easy to follow nomenclature. - Add new Intel CPU model IDs: - "Tiger Lake" desktop and mobile models - "Elkhart Lake" model ID - and the "Lightning Mountain" variant of Airmont, plus support code - Add the new AVX512_VP2INTERSECT instruction to cpufeatures - Remove Intel MPX user-visible APIs and the self-tests, because the toolchain (gcc) is not supporting it going forward. This is the first, lowest-risk phase of MPX removal. - Remove X86_FEATURE_MFENCE_RDTSC - Various smaller cleanups and fixes * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) x86/cpu: Update init data for new Airmont CPU model x86/cpu: Add new Airmont variant to Intel family x86/cpu: Add Elkhart Lake to Intel family x86/cpu: Add Tiger Lake to Intel family x86: Correct misc typos x86/intel: Add common OPTDIFFs x86/intel: Aggregate microserver naming x86/intel: Aggregate big core graphics naming x86/intel: Aggregate big core mobile naming x86/intel: Aggregate big core client naming x86/cpufeature: Explain the macro duplication x86/ftrace: Remove mcount() declaration x86/PCI: Remove superfluous returns from void functions x86/msr-index: Move AMD MSRs where they belong x86/cpu: Use constant definitions for CPU models lib: Remove redundant ftrace flag removal x86/crash: Remove unnecessary comparison x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE() x86: Remove X86_FEATURE_MFENCE_RDTSC x86/mpx: Remove MPX APIs ...
2019-08-28perf/x86/intel: Support PEBS output to PTAlexander Shishkin
If PEBS declares ability to output its data to Intel PT stream, use the aux_output attribute bit to enable PEBS data output to PT. This requires a PT event to be present and scheduled in the same context. Unlike the DS area, the kernel does not extract PEBS records from the PT stream to generate corresponding records in the perf stream, because that would require real time in-kernel PT decoding, which is not feasible. The PMI, however, can still be used. The output setting is per-CPU, so all PEBS events must be either writing to PT or to the DS area, therefore, in case of conflict, the conflicting event will fail to schedule, allowing the rotation logic to alternate between the PEBS->PT and PEBS->DS events. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: kan.liang@linux.intel.com Link: https://lkml.kernel.org/r/20190806084606.4021-3-alexander.shishkin@linux.intel.com
2019-08-26Merge tag 'v5.3-rc6' into x86/cpu, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-19x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16hTom Lendacky
There have been reports of RDRAND issues after resuming from suspend on some AMD family 15h and family 16h systems. This issue stems from a BIOS not performing the proper steps during resume to ensure RDRAND continues to function properly. RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND support using CPUID, including the kernel, will believe that RDRAND is not supported. Update the CPU initialization to clear the RDRAND CPUID bit for any family 15h and 16h processor that supports RDRAND. If it is known that the family 15h or family 16h system does not have an RDRAND resume issue or that the system will not be placed in suspend, the "rdrand=force" kernel parameter can be used to stop the clearing of the RDRAND CPUID bit. Additionally, update the suspend and resume path to save and restore the MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in place after resuming from suspend. Note, that clearing the RDRAND CPUID bit does not prevent a processor that normally supports the RDRAND instruction from executing it. So any code that determined the support based on family and model won't #UD. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chen Yu <yu.c.chen@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org> Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com
2019-08-19x86/msr-index: Move AMD MSRs where they belongBorislav Petkov
... sort them in and fixup comment, while at it. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190819070140.23708-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24x86/umwait: Initialize umwait control valuesFenghua Yu
umwait or tpause allows the processor to enter a light-weight power/performance optimized state (C0.1 state) or an improved power/performance optimized state (C0.2 state) for a period specified by the instruction or until the system time limit or until a store to the monitored address range in umwait. IA32_UMWAIT_CONTROL MSR register allows the OS to enable/disable C0.2 on the processor and to set the maximum time the processor can reside in C0.1 or C0.2. By default C0.2 is enabled so the user wait instructions can enter the C0.2 state to save more power with slower wakeup time. Andy Lutomirski proposed to set the maximum umwait time to 100000 cycles by default. A quote from Andy: "What I want to avoid is the case where it works dramatically differently on NO_HZ_FULL systems as compared to everything else. Also, UMWAIT may behave a bit differently if the max timeout is hit, and I'd like that path to get exercised widely by making it happen even on default configs." A sysfs interface to adjust the time and the C0.2 enablement is provided in a follow up change. [ tglx: Renamed MSR_IA32_UMWAIT_CONTROL_MAX_TIME to MSR_IA32_UMWAIT_CONTROL_TIME_MASK because the constant is used as mask throughout the code. Massaged comments and changelog ] Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov" <bp@alien8.de> Cc: "H Peter Anvin" <hpa@zytor.com> Cc: "Peter Zijlstra" <peterz@infradead.org> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com> Link: https://lkml.kernel.org/r/1560994438-235698-3-git-send-email-fenghua.yu@intel.com
2019-05-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "ARM: - support for SVE and Pointer Authentication in guests - PMU improvements POWER: - support for direct access to the POWER9 XIVE interrupt controller - memory and performance optimizations x86: - support for accessing memory not backed by struct page - fixes and refactoring Generic: - dirty page tracking improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits) kvm: fix compilation on aarch64 Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" kvm: x86: Fix L1TF mitigation for shadow MMU KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing" KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete tests: kvm: Add tests for KVM_SET_NESTED_STATE KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID tests: kvm: Add tests to .gitignore KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one KVM: Fix the bitmap range to copy during clear dirty KVM: arm64: Fix ptrauth ID register masking logic KVM: x86: use direct accessors for RIP and RSP KVM: VMX: Use accessors for GPRs outside of dedicated caching logic KVM: x86: Omit caching logic for always-available GPRs kvm, x86: Properly check whether a pfn is an MMIO or not ...
2019-05-14Merge branch 'x86-mds-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MDS mitigations from Thomas Gleixner: "Microarchitectural Data Sampling (MDS) is a hardware vulnerability which allows unprivileged speculative access to data which is available in various CPU internal buffers. This new set of misfeatures has the following CVEs assigned: CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory MDS attacks target microarchitectural buffers which speculatively forward data under certain conditions. Disclosure gadgets can expose this data via cache side channels. Contrary to other speculation based vulnerabilities the MDS vulnerability does not allow the attacker to control the memory target address. As a consequence the attacks are purely sampling based, but as demonstrated with the TLBleed attack samples can be postprocessed successfully. The mitigation is to flush the microarchitectural buffers on return to user space and before entering a VM. It's bolted on the VERW instruction and requires a microcode update. As some of the attacks exploit data structures shared between hyperthreads, full protection requires to disable hyperthreading. The kernel does not do that by default to avoid breaking unattended updates. The mitigation set comes with documentation for administrators and a deeper technical view" * 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/speculation/mds: Fix documentation typo Documentation: Correct the possible MDS sysfs values x86/mds: Add MDSUM variant to the MDS documentation x86/speculation/mds: Add 'mitigations=' support for MDS x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off x86/speculation/mds: Fix comment x86/speculation/mds: Add SMT warning message x86/speculation: Move arch_smt_update() call to after mitigation decisions x86/speculation/mds: Add mds=full,nosmt cmdline option Documentation: Add MDS vulnerability documentation Documentation: Move L1TF to separate directory x86/speculation/mds: Add mitigation mode VMWERV x86/speculation/mds: Add sysfs reporting for MDS x86/speculation/mds: Add mitigation control for MDS x86/speculation/mds: Conditionally clear CPU buffers on idle entry x86/kvm/vmx: Add MDS protection when L1D Flush is not active x86/speculation/mds: Clear CPU buffers on exit to user x86/speculation/mds: Add mds_clear_cpu_buffers() x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests x86/speculation/mds: Add BUG_MSBDS_ONLY ...
2019-04-30KVM: x86: Add support of clear Trace_ToPA_PMI statusLuwei Kang
Let guests clear the Intel PT ToPA PMI status (bit 55 of MSR_CORE_PERF_GLOBAL_OVF_CTRL). Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30KVM: x86: Inject PMI for KVM guestLuwei Kang
Inject a PMI for KVM guest when Intel PT working in Host-Guest mode and Guest ToPA entry memory buffer was completely filled. Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16perf/x86/intel: Support adaptive PEBS v4Kan Liang
Adaptive PEBS is a new way to report PEBS sampling information. Instead of a fixed size record for all PEBS events it allows to configure the PEBS record to only include the information needed. Events can then opt in to use such an extended record, or stay with a basic record which only contains the IP. The major new feature is to support LBRs in PEBS record. Besides normal LBR, this allows (much faster) large PEBS, while still supporting callstacks through callstack LBR. So essentially a lot of profiling can now be done without frequent interrupts, dropping the overhead significantly. The main requirement still is to use a period, and not use frequency mode, because frequency mode requires reevaluating the frequency on each overflow. The floating point state (XMM) is also supported, which allows efficient profiling of FP function arguments. Introduce specific drain function to handle variable length records. Use a new callback to parse the new record format, and also handle the STATUS field now being at a different offset. Add code to set up the configuration register. Since there is only a single register, all events either get the full super set of all events, or only the basic record. Originally-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: https://lkml.kernel.org/r/20190402194509.2832-6-kan.liang@linux.intel.com [ Renamed GPRS => GP. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-03-06x86/speculation/mds: Add basic bug infrastructure for MDSAndi Kleen
Microarchitectural Data Sampling (MDS), is a class of side channel attacks on internal buffers in Intel CPUs. The variants are: - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126) - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130) - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127) MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a dependent load (store-to-load forwarding) as an optimization. The forward can also happen to a faulting or assisting load operation for a different memory address, which can be exploited under certain conditions. Store buffers are partitioned between Hyper-Threads so cross thread forwarding is not possible. But if a thread enters or exits a sleep state the store buffer is repartitioned which can expose data from one thread to the other. MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage L1 miss situations and to hold data which is returned or sent in response to a memory or I/O operation. Fill buffers can forward data to a load operation and also write data to the cache. When the fill buffer is deallocated it can retain the stale data of the preceding operations which can then be forwarded to a faulting or assisting load operation, which can be exploited under certain conditions. Fill buffers are shared between Hyper-Threads so cross thread leakage is possible. MLDPS leaks Load Port Data. Load ports are used to perform load operations from memory or I/O. The received data is then forwarded to the register file or a subsequent operation. In some implementations the Load Port can contain stale data from a previous operation which can be forwarded to faulting or assisting loads under certain conditions, which again can be exploited eventually. Load ports are shared between Hyper-Threads so cross thread leakage is possible. All variants have the same mitigation for single CPU thread case (SMT off), so the kernel can treat them as one MDS issue. Add the basic infrastructure to detect if the current CPU is affected by MDS. [ tglx: Rewrote changelog ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
2019-03-06x86/msr-index: Cleanup bit definesThomas Gleixner
Greg pointed out that speculation related bit defines are using (1 << N) format instead of BIT(N). Aside of that (1 << N) is wrong as it should use 1UL at least. Clean it up. [ Josh Poimboeuf: Fix tools build ] Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
2019-03-06x86: Add TSX Force Abort CPUID/MSRPeter Zijlstra (Intel)
Skylake systems will receive a microcode update to address a TSX errata. This microcode will (by default) clobber PMC3 when TSX instructions are (speculatively or not) executed. It also provides an MSR to cause all TSX transaction to abort and preserve PMC3. Add the CPUID enumeration and MSR definition. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-12-26Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "ARM: - selftests improvements - large PUD support for HugeTLB - single-stepping fixes - improved tracing - various timer and vGIC fixes x86: - Processor Tracing virtualization - STIBP support - some correctness fixes - refactorings and splitting of vmx.c - use the Hyper-V range TLB flush hypercall - reduce order of vcpu struct - WBNOINVD support - do not use -ftrace for __noclone functions - nested guest support for PAUSE filtering on AMD - more Hyper-V enlightenments (direct mode for synthetic timers) PPC: - nested VFIO s390: - bugfixes only this time" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits) KVM: x86: Add CPUID support for new instruction WBNOINVD kvm: selftests: ucall: fix exit mmio address guessing Revert "compiler-gcc: disable -ftracer for __noclone functions" KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup MAINTAINERS: Add arch/x86/kvm sub-directories to existing KVM/x86 entry KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range() KVM/MMU: Flush tlb directly in kvm_set_pte_rmapp() KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte() KVM: Make kvm_set_spte_hva() return int KVM: Replace old tlb flush function with new one to flush a specified range. KVM/MMU: Add tlb flush with range helper function KVM/VMX: Add hv tlb range flush support x86/hyper-v: Add HvFlushGuestAddressList hypercall support KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops KVM: x86: Disable Intel PT when VMXON in L1 guest KVM: x86: Set intercept for Intel PT MSRs read/write KVM: x86: Implement Intel PT MSRs read/write emulation ...
2018-12-21KVM: x86: Add Intel PT virtualization work modeChao Peng
Intel Processor Trace virtualization can be work in one of 2 possible modes: a. System-Wide mode (default): When the host configures Intel PT to collect trace packets of the entire system, it can leave the relevant VMX controls clear to allow VMX-specific packets to provide information across VMX transitions. KVM guest will not aware this feature in this mode and both host and KVM guest trace will output to host buffer. b. Host-Guest mode: Host can configure trace-packet generation while in VMX non-root operation for guests and root operation for native executing normally. Intel PT will be exposed to KVM guest in this mode, and the trace output to respective buffer of host and guest. In this mode, tht status of PT will be saved and disabled before VM-entry and restored after VM-exit if trace a virtual machine. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21perf/x86/intel/pt: Add new bit definitions for PT MSRsLuwei Kang
Add bit definitions for Intel PT MSRs to support trace output directed to the memeory subsystem and holds a count if packet bytes that have been sent out. These are required by the upcoming PT support in KVM guests for MSRs read/write emulation. Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21perf/x86/intel/pt: Move Intel PT MSRs bit defines to global headerChao Peng
The Intel Processor Trace (PT) MSR bit defines are in a private header. The upcoming support for PT virtualization requires these defines to be accessible from KVM code. Move them to the global MSR header file. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-18kvm: x86: Add AMD's EX_CFG to the list of ignored MSRsEduardo Habkost
Some guests OSes (including Windows 10) write to MSR 0xc001102c on some cases (possibly while trying to apply a CPU errata). Make KVM ignore reads and writes to that MSR, so the guest won't crash. The MSR is documented as "Execution Unit Configuration (EX_CFG)", at AMD's "BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors". Cc: stable@vger.kernel.org Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-28x86/speculation: Prepare for per task indirect branch speculation controlTim Chen
To avoid the overhead of STIBP always on, it's necessary to allow per task control of STIBP. Add a new task flag TIF_SPEC_IB and evaluate it during context switch if SMT is active and flag evaluation is enabled by the speculation control code. Add the conditional evaluation to x86_virt_spec_ctrl() as well so the guest/host switch works properly. This has no effect because TIF_SPEC_IB cannot be set yet and the static key which controls evaluation is off. Preparatory patch for adding the control code. [ tglx: Simplify the context switch logic and make the TIF evaluation depend on SMP=y and on the static key controlling the conditional update. Rename it to TIF_SPEC_IB because it controls both STIBP and IBPB ] Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.176917199@linutronix.de
2018-10-02perf/x86/intel: Add a separate Arch Perfmon v4 PMI handlerAndi Kleen
Implements counter freezing for Arch Perfmon v4 (Skylake and newer). This allows to speed up the PMI handler by avoiding unnecessary MSR writes and make it more accurate. The Arch Perfmon v4 PMI handler is substantially different than the older PMI handler. Differences to the old handler: - It relies on counter freezing, which eliminates several MSR writes from the PMI handler and lowers the overhead significantly. It makes the PMI handler more accurate, as all counters get frozen atomically as soon as any counter overflows. So there is much less counting of the PMI handler itself. With the freezing we don't need to disable or enable counters or PEBS. Only BTS which does not support auto-freezing still needs to be explicitly managed. - The PMU acking is done at the end, not the beginning. This makes it possible to avoid manual enabling/disabling of the PMU, instead we just rely on the freezing/acking. - The APIC is acked before reenabling the PMU, which avoids problems with LBRs occasionally not getting unfreezed on Skylake. - Looping is only needed to workaround a corner case which several PMIs are very close to each other. For common cases, the counters are freezed during PMI handler. It doesn't need to do re-check. This patch: - Adds code to enable v4 counter freezing - Fork <=v3 and >=v4 PMI handlers into separate functions. - Add kernel parameter to disable counter freezing. It took some time to debug counter freezing, so in case there are new problems we added an option to turn it off. Would not expect this to be used until there are new bugs. - Only for big core. The patch for small core will be posted later separately. Performance: When profiling a kernel build on Kabylake with different perf options, measuring the length of all NMI handlers using the nmi handler trace point: V3 is without counter freezing. V4 is with counter freezing. The value is the average cost of the PMI handler. (lower is better) perf options ` V3(ns) V4(ns) delta -c 100000 1088 894 -18% -g -c 100000 1862 1646 -12% --call-graph lbr -c 100000 3649 3367 -8% --c.g. dwarf -c 100000 2248 1982 -12% Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Link: http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-08-05x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentryPaolo Bonzini
Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is not needed. Add a new value to enum vmx_l1d_flush_state, which is used either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-04x86/KVM/VMX: Add L1D MSR based flushPaolo Bonzini
336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR (IA32_FLUSH_CMD aka 0x10B) which has similar write-only semantics to other MSRs defined in the document. The semantics of this MSR is to allow "finer granularity invalidation of caching structures than existing mechanisms like WBINVD. It will writeback and invalidate the L1 data cache, including all cachelines brought in by preceding instructions, without invalidating all caches (eg. L2 or LLC). Some processors may also invalidate the first level level instruction cache on a L1D_FLUSH command. The L1 data and instruction caches may be shared across the logical processors of a core." Use it instead of the loop based L1 flush algorithm. A copy of this document is available at https://bugzilla.kernel.org/show_bug.cgi?id=199511 [ tglx: Avoid allocating pages when the MSR is available ] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-06-01x86: msr-index.h: Correct SNB_C1/C3_AUTO_UNDEMOTE definesMatt Turner
According to the Intel Software Developers' Manual, Vol. 4, Order No. 335592, these macros have been reversed since they were added in the initial turbostat commit. The reversed definitions were presumably copied from turbostat.c to this file. Fixes: 9c63a650bb10 ("tools/power/x86/turbostat: share kernel MSR #defines") Signed-off-by: Matt Turner <mattst88@gmail.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Len Brown <len.brown@intel.com>