aboutsummaryrefslogtreecommitdiffstats
path: root/features
AgeCommit message (Collapse)Author
2019-07-22aufs5: fix build on v5.3+Bruce Ashfield
1/1 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs5: fix build on v5.3+ Date: Mon, 22 Jul 2019 23:14:41 -0400 commit 9af0f1a46bbb6ad9ee8b35957251f4aa826b023f changes the rw_sem task owner to an atomic type. We switch to using the introduced helper function to get aufs compiling against 5.3 Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-07-22patches: refresh for 5.3Bruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-07-19features: add support for RANDOM_TRUST_CPUAnuj Mittal
Enabling this would make the kernel trust CPU's random number generator for the purposes of initializing CRNG. Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-07-19security.cfg: unset HARDENED_USERCOPY_FALLBACKAnuj Mittal
Disable fallback to gain full whitelist enforcement. Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-07-18features/intel-persistent-memory: add pmem support for intel-x86-64Yongxin Liu
Because CONFIG_DEV_DAX* are not supported in preempt-rt kernel, use two scc files for Non-RT kerel and RT kernel separately. Signed-off-by: Yongxin Liu <yongxin.liu@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-07-05feature/net: Create a config variant of net.scc to for featuresHe Zhe
Since b5776165c9d3 ("netfilter: Fix remainder of pseudo-header protocol 0"), net.scc includes a patch which would be reapplied when included by other features. This patches create a config variant net-enable.scc for features and leaves net.scc to BSPs as is. Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-07-03aufs5: kbuild supportBruce Ashfield
1/5 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs5: kbuild support Date: Wed, 3 Jul 2019 10:53:33 -0400 Application of aufs5-kbuild.patch Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 2/5 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs5: base support Date: Wed, 3 Jul 2019 10:54:31 -0400 Application of aufs5-base.patch Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 3/5 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs5: mmap support Date: Wed, 3 Jul 2019 10:55:03 -0400 Application of; aufs5-mmap.patch Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 4/5 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs5: standalone support Date: Wed, 3 Jul 2019 10:55:36 -0400 Application of: aufs5-standalone.patch Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 5/5 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs5: core support Date: Wed, 3 Jul 2019 10:56:53 -0400 Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-06-27features/security: Add more kernel hardening fragmentsHe Zhe
Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-06-27netfilter: Fix remainder of pseudo-header protocol 0Bruce Ashfield
1/1 [ Author: He Zhe Email: zhe.he@windriver.com Subject: netfilter: Fix remainder of pseudo-header protocol 0 Date: Tue, 25 Jun 2019 18:15:50 +0800 Since v5.1-rc1, some types of packets do not get unreachable reply with the following iptables setting. Fox example, $ iptables -A INPUT -p icmp --icmp-type 8 -j REJECT $ ping 127.0.0.1 -c 1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. — 127.0.0.1 ping statistics — 1 packets transmitted, 0 received, 100% packet loss, time 0ms We should have got the following reply from command line, but we did not. From 127.0.0.1 icmp_seq=1 Destination Port Unreachable Yi Zhao reported it and narrowed it down to: 7fc38225363d ("netfilter: reject: skip csum verification for protocols that don't support it"), This is because nf_ip_checksum still expects pseudo-header protocol type 0 for packets that are of neither TCP or UDP, and thus ICMP packets are mistakenly treated as TCP/UDP. This patch corrects the conditions in nf_ip_checksum and all other places that still call it with protocol 0. Fixes: 7fc38225363d ("netfilter: reject: skip csum verification for protocols that don't support it") Reported-by: Yi Zhao <yi.zhao@windriver.com> Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-06-21scsi-debug: include core scsi support for standalone inclusionBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-06-10power: create the minimum set of cpu idle and cpufreq scaling for arm/arm64Zumeng Chen
This patch is to create the minimum set for cpu idle/freq scaling, which is ok to both arm and arm64 with DT enablement. And we leave more specific features to the end users. Signed-off-by: Zumeng.Chen <zumeng.chen@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-06-03Add SCSI debug configuration for util-linux ptestMariano López
The ptests from util-linux require the scsi debug module to be installed for a subset of tests. This patch would allow to build the kernel module for the linux-yocto kernel. Signed-off-by: Mariano López <just.another.mariano@gmail.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-05-29dev: v5.2 refresh and prepBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-05-16netfilter/netfilter.cfg: remove CONFIG_NF_NAT_IPV4Anuj Mittal
This has been removed starting v5.1 and nf_nat_ipv4,6 have been merged in nat core. https://github.com/torvalds/linux/commit/3bf195ae6037e310d693ff3313401cfaf1261b71 Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-04-26usb-net: Add RNDIS host supportTom Rini
Enable support for the USB RNDIS host driver. This is commonly seen with Android devices being used in USB tether mode. Signed-off-by: Tom Rini <trini@konsulko.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-03-21intel-x86: add Intel VMD supportLiwei Song
Enable CONFIG_VMD=y to support Intel VMD and VROC feature. Signed-off-by: Liwei Song <liwei.song@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-03-18intel-pinctrl: add pinctrl support for cannonlakePradhan Surya Narayanx
This patch adds pinctrl support for cannonlake. Signed-off-by: Pradhan Surya Narayanx <surya.narayanx.pradhan@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-03-05aufs4: kbuild patchBruce Ashfield
1/6 [ Author: Email: Subject: aufs4: kbuild patch Date: Tue, 5 Mar 2019 18:52:18 -0500 Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 2/6 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs4: base Date: Tue, 5 Mar 2019 18:53:37 -0500 upstream: https://github.com/sfjro/aufs4-standalone.git Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 3/6 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs4: mmap Date: Tue, 5 Mar 2019 18:54:13 -0500 upstream: https://github.com/sfjro/aufs4-standalone.git Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 4/6 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs4: standalone Date: Tue, 5 Mar 2019 18:54:49 -0500 upstream https://github.com/sfjro/aufs4-standalone.git Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 5/6 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs4: core support Date: Tue, 5 Mar 2019 18:56:15 -0500 upsream: https://github.com/sfjro/aufs4-standalone.git Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] 6/6 [ Author: Bruce Ashfield Email: bruce.ashfield@gmail.com Subject: aufs4: fix access_ok calls Date: Tue, 5 Mar 2019 19:23:06 -0500 Commit 96d4f267e [Remove 'type' argument from access_ok() function] changed the access_ok calls to no longer need a type parameters. We adjust the aufs calls to match. Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-03-05netfilter: drop removed config optionsBruce Ashfield
As per commit faec18dbb0405 [netfilter: nat: remove l4proto->manip_pkt], NF_NAT_PROTO_SCTP is no longer valid in a 5.x kernel. So we drop it from our fragment. Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-02-27features: Add the drm-bochs featureAlistair Francis
Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-02-27features: drop the obsolete kernel optionHongzhi.Song
CONFIG_MMC_BLOCK_BOUNCE has been removed by kernel. Commit id: de3ee99b097d. Signed-off-by: Hongzhi.Song <hongzhi.song@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-02-27features/hostapd: drop obsolete configsHongzhi.Song
CONFIG_WIRELESS_EXT_SYSFS has been removed by kernel. Commit id: 35b2a113cb. Signed-off-by: Hongzhi.Song <hongzhi.song@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-02-15netfilter: Enable CONFIG_NETFILTER_XT_TARGET_LOGTom Rini
In order for logging to work, as for example seen with the default configuration of 'ufw' we need to have logging support enabled. This is currently gated on the CONFIG_NETFILTER_XT_TARGET_LOG option, so enable it here. Fixes: f56608b405f0 ("meta: cleanup invalid/obselete 3.4 CONFIG options") Signed-off-by: Tom Rini <trini@konsulko.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-02-05housekeeping: remove Kconfigs which no longer exist in the kernelMark Asselstine
While working with meta-virtualization I was cleaning up build warnings: WARNING: linux-yocto-4.18.21+gitAUTOINC+9e348b6f9d_db2d813869-r0 do_kernel_configcheck: [kernel config]: This BSP sets config options that are not offered anywhere within this kernel: CONFIG_EXT3_FS_XATTR CONFIG_RESOURCE_COUNTERS CONFIG_CGROUP_MEM_RES_CTLR CONFIG_CLS_CGROUP CONFIG_NETPRIO_CGROUP CONFIG_DEVPTS_MULTIPLE_INSTANCES Most of these were coming from a configuration fragment included in meta-virtualization but some are also found here in the kernel-cache. So while I had the history handy to complete the commit for meta-virtualization I thought it best to do similar cleanup here. The follow 3 kconfigs have been removed from the kernel-cache, per the provided rationale. CONFIG_DEVPTS_MULTIPLE_INSTANCES removed since kernel v4.7 via mainline commit eedf265aa003b4781de24cfed40a655a664457e6 CONFIG_EXT3_FS_XATTR was revoved since kernel v4.3 via mainline commit c290ea01abb7907fde602f3ba55905ef10a37477 CONFIG_RESOURCE_COUNTERS gone since kernel v3.19 via mainline commit 5b1efc027c0b51ca3e76f4e00c83358f8349f543. Signed-off-by: Mark Asselstine <mark.asselstine@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
2019-01-23yaffs2: adjust to new location of MS_RDONLY defineBruce Ashfield
1/1 [ Author: OpenEmbedded Email: oe.patch@oe Subject: yaffs2: adjust to new location of MS_RDONLY define Date: Wed, 23 Jan 2019 14:50:26 +0000 Signed-off-by: OpenEmbedded <oe.patch@oe> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2019-01-10v5.0: kernel patch prepBruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-11-29security.cfg: rename STACKPROTECTOR configsAnuj Mittal
Rename and let kernel config determine the right option to enable as per: https://github.com/torvalds/linux/commit/2a61f4747eeaa85ce26ca9fbd81421b15facd018 Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-11-29media-usb-tv: remove CONFIG_DVB_USB_FRIIOAnuj Mittal
This has been merged with GL861 which is enabled by this feature. https://github.com/torvalds/linux/commit/b30cc07de8a903685441f9770b1b21e1422d2468 Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-11-29netfilter: remove obsolete entriesAnuj Mittal
NF_CONNTRACK_IPV4 and NF_CONNTRACK_IPV6 are no longer present starting 4.19 and instead unified under NF_CONNTRACK which is already enabled. https://github.com/torvalds/linux/commit/a0ae2562c6c4b2721d9fddba63b7286c13517d9f Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-11-12intel-x86-64: Move some configs from x86 and x86-64 shared to x86-64Hongzhi.Song
These configs are possessed privately by x86-64. CONFIG_IXGBE_DCA depends on CONFIG_DCA. CONFIG_DCA depends on x86-64. So I abstract configs related to DCA and put them to *-x86-64.cfg. Signed-off-by: Hongzhi.Song <hongzhi.song@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-11-05features/module-signing: add new featureAnuj Mittal
Add feature to enable signing of modules. If signing is to be forced, force-signing should be included, else signing.scc. Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-11-05edac: Drop CONFIG_EDAC_MM_EDAC and add dependencyHongzhi.Song
CONFIG_EDAC depends on RAS. CONFIG_EDAC_MM_EDAC was removed by kernel.[Commit: e3c4ff6d8c9] Signed-off-by: Hongzhi.Song <hongzhi.song@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-30xfs: add xfs supportDengke Du
Ceph osd daemon need to work on xfs filesystem, so add the kernel support. Signed-off-by: Dengke Du <dengke.du@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-22lxc: drop CONFIG_MM_OWNERHongzhi.Song
Kernel has killed CONFIG_MM_OWNER config fragment. [kernel commit id: f98bafa06] Signed-off-by: Hongzhi.Song <hongzhi.song@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-22vfio: drop CONFIG_KVM_DEVICE_ASSIGNMENTHongzhi.Song
Kernel has killed the config fragment for legacy device assignment deprecated since v4.2. [kernel commit id: ad6260da1e23] Signed-off-by: Hongzhi.Song <hongzhi.song@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-10media-radio.cfg: change CONFIG_RADIO_SI470X to mAnuj Mittal
This is now a tristate instead of bool and since we set V4L2 to be m, set this to be m too. https://github.com/torvalds/linux/commit/58757984ca3c73284a45dd53ac66f1414057cd09 Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-10usb-typec: enable CONFIF_TYPECAnuj Mittal
typec configs are now controlled by CONFIG_TYPEC [1]. Also enable CONFIG_TYPEC_TCPM for TYPEC_WCOVE [2]. [1] https://github.com/torvalds/linux/commit/a7c42106ead7041b99662a125b408deb68a3e6aa [2] https://github.com/torvalds/linux/commit/3c4fb9f169214290ec9a943907321e6265b36f65 Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-10iio: rename CONFIG_TSL2x7x to CONFIG_TSL2772Anuj Mittal
https://github.com/torvalds/linux/commit/4e24c1719f3485780b2be559e5fc11d091139935 Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-10iio: change CONFIG_AD5686 to CONFIG_AD5686_SPIAnuj Mittal
https://github.com/torvalds/linux/commit/0357e488b825313db3d574137337557f404e59ed Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-10media-rc: CONFIG_LIRC is now a boolAnuj Mittal
Change default configuration to 'y' and remove the now obsolete CONFIG_LIRC_CODEC. https://github.com/torvalds/linux/commit/a60d64b15c20d178ba3a9bc3a542492b4ddeea70 Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-10media-i2c: remove configs selected by zoran driversAnuj Mittal
These are selected only by zoran drivers and aren't required to be enabled explicitly. Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-10media-pci-capture: remove zoran configsAnuj Mittal
These drivers have been moved to staging and will be removed from future kernel versions. Instead of enabling staging drivers to be built, remove these instead. https://github.com/torvalds/linux/commit/68afa17322f2c9a0fffca62e7afe9d60b0dff87e Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-10-08dev: update for 4.19+Bruce Ashfield
Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-09-06intel-x86: Some required kernel config is not enabled as expectedHongzhi.Song
According to the changes before linux-4.18-rc6, we must add new dependency for some config option or modify them value. 1. If "CONFIG_PINCTRL_SUNRISEPOINT=m", CONFIG_PINCTRL_INTEL cann't be set y, because CONFIG_PINCTRL_INTEL is selected by CONFIG_PINCTRL_SUNRISEPOINT. 2. The dependency of CONFIG_MMC_REALTEK_PCI is change to CONFIG_MISC_RTSX_PCI. 3. CONFIG_INTEL_SOC_PMIC_BXTWC is a new dependency for CONFIG_INTEL_BXT_PMIC_THERMAL. 4. The value of CONFIG_HW_RANDOM_TPM must be equal to CONFIG_TCG_TPM. Signed-off-by: Hongzhi.Song <hongzhi.song@windriver.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-08-30features/thermal: use the correct config nameAnuj Mittal
CONFIG_INTEL_PMIC_THERMAL was enabled for the bxt kernel tree which had in-review patches as well. This config was re-named to CONFIG_INTEL_BXT_PMIC_THERMAL in the final merged version of patch: https://github.com/torvalds/linux/commit/b474303ffd57e0a379ce73ca10232350f866f77b Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-08-30features/crypto: drop featureAnuj Mittal
The only config enabled by this feature, CRYPTO_ZLIB, was removed starting 4.6 kernel [1]. [1] https://github.com/torvalds/linux/commit/110492183c4b8f572b16fce096b9d78e2da30baf Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-08-29features/media: drop obsolete configAnuj Mittal
Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-08-29features/soc/baytrail: fix conflict with configsAnuj Mittal
Change I2C_DESIGNWARE configs to y to prevent conflicts. It is forced to y anyway because of INTEL_SOC_PMIC which is enabled for intel-core BSP. Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-08-29features: drop obsolete configsAnuj Mittal
These are no longer present and give warnings when used with KCONF_BSP_AUDIT set. Signed-off-by: Anuj Mittal <anuj.mittal@intel.com> Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>
2018-08-22libsas: remove irq save in sas_ata_qc_issue()Bruce Ashfield
1/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: libsas: remove irq save in sas_ata_qc_issue() Date: Thu, 12 Apr 2018 09:16:22 +0200 [ upstream commit 2da11d4262639dc0e2fabc6a70886db57af25c43 ] Since commit 312d3e56119a ("[SCSI] libsas: remove ata_port.lock management duties from lldds") the sas_ata_qc_issue() function unlocks the ata_port.lock and disables interrupts before doing so. That lock is always taken with disabled interrupts so at this point, the interrupts are already disabled. There is no need to disable the interrupts before the unlock operation because they are already disabled. Restoring the interrupt state later does not change anything because they were disabled and remain disabled. Therefore remove the operations which do not change the behaviour. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 2/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: qla2xxx: remove irq save in qla2x00_poll() Date: Thu, 12 Apr 2018 09:55:25 +0200 [ upstream commit b3a8aa90c46095cbad454eb068bfb5a8eb56d4e3 ] In commit d2ba5675d899 ("[SCSI] qla2xxx: Disable local-interrupts while polling for RISC status.") added a local_irq_disable() before invoking the ->intr_handler callback. The function, which was used in this callback, did not disable interrupts while acquiring the spin_lock so a deadlock was possible and this change was one possible solution. The function in question was qla2300_intr_handler() and is using spin_lock_irqsave() since commit 43fac4d97a1a ("[SCSI] qla2xxx: Resolve a performance issue in interrupt"). I checked all other ->intr_handler callbacks and all of them use the irqsave variant so it is safe to remove the local_irq_save() block now. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 3/258 [ Author: "Steven Rostedt (VMware)" Email: rostedt@goodmis.org Subject: cgroup/tracing: Move taking of spin lock out of trace event handlers Date: Mon, 9 Jul 2018 17:48:54 -0400 [ Upstream commit e4f8d81c738db6d3ffdabfb8329aa2feaa310699 ] It is unwise to take spin locks from the handlers of trace events. Mainly, because they can introduce lockups, because it introduces locks in places that are normally not tested. Worse yet, because trace events are tucked away in the include/trace/events/ directory, locks that are taken there are forgotten about. As a general rule, I tell people never to take any locks in a trace event handler. Several cgroup trace event handlers call cgroup_path() which eventually takes the kernfs_rename_lock spinlock. This injects the spinlock in the code without people realizing it. It also can cause issues for the PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event handlers are called with preemption disabled. By moving the calculation of the cgroup_path() out of the trace event handlers and into a macro (surrounded by a trace_cgroup_##type##_enabled()), then we could place the cgroup_path into a string, and pass that to the trace event. Not only does this remove the taking of the spinlock out of the trace event handler, but it also means that the cgroup_path() only needs to be called once (it is currently called twice, once to get the length to reserver the buffer for, and once again to get the path itself. Now it only needs to be done once. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 4/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: sched/core: Remove get_cpu() from sched_fork() Date: Fri, 6 Jul 2018 15:06:15 +0200 [ Upstream commit af0fffd9300b97d8875aa745bc78e2f6fdb3c1f0 ] get_cpu() disables preemption for the entire sched_fork() function. This get_cpu() was introduced in commit: dd41f596cda0 ("sched: cfs core code") ... which also invoked sched_balance_self() and this function required preemption do be off. Today, sched_balance_self() seems to be moved to ->task_fork callback which is invoked while the ->pi_lock is held. set_load_weight() could invoke reweight_task() which then via $callchain might end up in smp_processor_id() but since `update_load' is false this won't happen. I didn't find any this_cpu*() or similar usage during the initialisation of the task_struct. The `cpu' value (from get_cpu()) is only used later in __set_task_cpu() while the ->pi_lock lock is held. Based on this it is possible to remove get_cpu() and use smp_processor_id() for the `cpu' variable without breaking anything. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180706130615.g2ex2kmfu5kcvlq6@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> ] 5/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: random: Remove preempt disabled region Date: Fri, 3 Jul 2009 08:29:30 -0500 No need to keep preemption disabled across the whole function. mix_pool_bytes() uses a spin_lock() to protect the pool and there are other places like write_pool() whhich invoke mix_pool_bytes() without disabling preemption. credit_entropy_bits() is invoked from other places like add_hwgenerator_randomness() without disabling preemption. Before commit 95b709b6be49 ("random: drop trickle mode") the function used __this_cpu_inc_return() which would require disabled preemption. The preempt_disable() section was added in commit 43d5d3018c37 ("[PATCH] random driver preempt robustness", history tree). It was claimed that the code relied on "vt_ioctl() being called under BKL". Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: enhance the commit message] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 6/258 [ Author: Anna-Maria Gleixner Email: anna-maria@linutronix.de Subject: iommu/amd: Remove redundant WARN_ON() Date: Fri, 20 Jul 2018 10:45:45 +0200 The WARN_ON() was introduced in commit 272e4f99e966 ("iommu/amd: WARN when __[attach|detach]_device are called with irqs enabled") to ensure that the domain->lock is taken in proper irqs disabled context. This is required, because the domain->lock is taken as well in irq context. The proper context check by the WARN_ON() is redundant, because it is already covered by LOCKDEP. When working with locks and changing context, a run with LOCKDEP is required anyway and would detect the wrong lock context. Furthermore all callers for those functions are within the same file and all callers acquire another lock which already disables interrupts. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 7/258 [ Author: Anna-Maria Gleixner Email: anna-maria@linutronix.de Subject: drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock() Date: Fri, 4 May 2018 17:45:32 +0200 The irqsave variant of atomic_dec_and_lock handles irqsave/restore when taking/releasing the spin lock. With this variant the call of local_irq_save is no longer required. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 8/258 [ Author: Anna-Maria Gleixner Email: anna-maria@linutronix.de Subject: drivers/md/raid5: Do not disable irq on release_inactive_stripe_list() call Date: Fri, 4 May 2018 17:45:33 +0200 There is no need to invoke release_inactive_stripe_list() with interrupts disabled. All call sites, except raid5_release_stripe(), unlock ->device_lock and enable interrupts before invoking the function. Make it consistent. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 9/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: bdi: use refcount_t for reference counting instead atomic_t Date: Mon, 7 May 2018 16:51:09 +0200 refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 10/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: userns: use refcount_t for reference counting instead atomic_t Date: Mon, 7 May 2018 17:09:42 +0200 refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 11/258 [ Author: Anna-Maria Gleixner Email: anna-maria@linutronix.de Subject: bdi: Use irqsave variant of refcount_dec_and_lock() Date: Wed, 4 Apr 2018 11:43:56 +0200 The irqsave variant of refcount_dec_and_lock handles irqsave/restore when taking/releasing the spin lock. With this variant the call of local_irq_save/restore is no longer required. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> [bigeasy: s@atomic_dec_and_lock@refcount_dec_and_lock@g ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 12/258 [ Author: Anna-Maria Gleixner Email: anna-maria@linutronix.de Subject: userns: Use irqsave variant of refcount_dec_and_lock() Date: Wed, 4 Apr 2018 11:43:57 +0200 The irqsave variant of refcount_dec_and_lock handles irqsave/restore when taking/releasing the spin lock. With this variant the call of local_irq_save/restore is no longer required. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> [bigeasy: s@atomic_dec_and_lock@refcount_dec_and_lock@g ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 13/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: libata: remove ata_sff_data_xfer_noirq() Date: Thu, 19 Apr 2018 12:55:14 +0200 ata_sff_data_xfer_noirq() is invoked via the ->sff_data_xfer hook. The latter is invoked by ata_pio_sector(), atapi_send_cdb() and __atapi_pio_bytes() which in turn is invoked by ata_sff_hsm_move(). The latter function requires that the "ap->lock" lock is held which needs to be taken with disabled interrupts. There is no need have to have ata_sff_data_xfer_noirq() which invokes ata_sff_data_xfer32() with disabled interrupts because at this point the interrupts are already disabled. Remove the function and its references to it and replace all callers with ata_sff_data_xfer32(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 14/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ntfs: don't disable interrupts during kmap_atomic() Date: Tue, 10 Apr 2018 17:54:32 +0200 ntfs_end_buffer_async_read() disables interrupts around kmap_atomic(). This is a leftover from the old kmap_atomic() implementation which relied on fixed mapping slots, so the caller had to make sure that the same slot could not be reused from an interrupting context. kmap_atomic() was changed to dynamic slots long ago and commit 1ec9c5ddc17a ("include/linux/highmem.h: remove the second argument of k[un]map_atomic()") removed the slot assignements, but the callers were not checked for now redundant interrupt disabling. Remove the conditional interrupt disable. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 15/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: workingset: remove local_irq_disable() from count_shadow_nodes() Date: Fri, 22 Jun 2018 10:48:51 +0200 In commit 0c7c1bed7e13 ("mm: make counting of list_lru_one::nr_items lockless") the spin_lock(&nlru->lock); statement was replaced with rcu_read_lock(); in __list_lru_count_one(). The comment in count_shadow_nodes() says that the local_irq_disable() is required because the lock must be acquired with disabled interrupts and (spin_lock()) does not do so. Since the lock is replaced with rcu_read_lock() the local_irq_disable() is no longer needed. The code path is list_lru_shrink_count() -> list_lru_count_one() -> __list_lru_count_one() -> rcu_read_lock() -> list_lru_from_memcg_idx() -> rcu_read_unlock() Remove the local_irq_disable() statement. Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 16/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm: workingset: make shadow_lru_isolate() use locking suffix Date: Fri, 22 Jun 2018 11:43:35 +0200 shadow_lru_isolate() disables interrupts and acquires a lock. It could use spin_lock_irq() instead. It also uses local_irq_enable() while it could use spin_unlock_irq()/xa_unlock_irq(). Use proper suffix for lock/unlock in order to enable/disable interrupts during release/acquire of a lock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 17/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/list_lru: use list_lru_walk_one() in list_lru_walk_node() Date: Tue, 3 Jul 2018 12:56:19 +0200 list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the memcg_idx parameter. The same can be achieved by list_lru_walk_one() and passing NULL as memcg argument which then gets converted into -1. This is a preparation step when the spin_lock() function is lifted to the caller of __list_lru_walk_one(). Invoke list_lru_walk_one() instead __list_lru_walk_one() when possible. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 18/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/list_lru: Move locking from __list_lru_walk_one() to its caller Date: Tue, 3 Jul 2018 13:06:07 +0200 Move the locking inside __list_lru_walk_one() to its caller. This is a preparation step in order to introduce list_lru_walk_one_irq() which does spin_lock_irq() instead of spin_lock() for the locking. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 19/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/list_lru: Pass struct list_lru_node as an argument __list_lru_walk_one() Date: Tue, 3 Jul 2018 13:08:56 +0200 __list_lru_walk_one() is invoked with struct list_lru *lru, int nid as the first two argument. Those two are only used to retrieve struct list_lru_node. Since this is already done by the caller of the function for the locking, we can pass struct list_lru_node directly and avoid the dance around it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 20/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/list_lru: Introduce list_lru_shrink_walk_irq() Date: Tue, 3 Jul 2018 13:17:27 +0200 Provide list_lru_shrink_walk_irq() and let it behave like list_lru_walk_one() except that it locks the spinlock with spin_lock_irq(). This is used by scan_shadow_nodes() because its lock nests within the i_pages lock which is acquired with IRQ. This change allows to use proper locking promitives instead hand crafted lock_irq_disable() plus spin_lock(). There is no EXPORT_SYMBOL provided because the current user is in-KERNEL only. Add list_lru_shrink_walk_irq() which acquires the spinlock with the proper locking primitives. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 21/258 [ Author: Alexandre Belloni Email: alexandre.belloni@bootlin.com Subject: ARM: at91: add TCB registers definitions Date: Wed, 18 Apr 2018 12:51:38 +0200 Add registers and bits definitions for the timer counter blocks found on Atmel ARM SoCs. Tested-by: Alexander Dahl <ada@thorsis.com> Tested-by: Andras Szemzo <szemzo.andras@gmail.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 22/258 [ Author: Alexandre Belloni Email: alexandre.belloni@bootlin.com Subject: clocksource/drivers: Add a new driver for the Atmel ARM TC blocks Date: Wed, 18 Apr 2018 12:51:39 +0200 Add a driver for the Atmel Timer Counter Blocks. This driver provides a clocksource and two clockevent devices. One of the clockevent device is linked to the clocksource counter and so it will run at the same frequency. This will be used when there is only on TCB channel available for timers. The other clockevent device runs on a separate TCB channel when available. This driver uses regmap and syscon to be able to probe early in the boot and avoid having to switch on the TCB clocksource later. Using regmap also means that unused TCB channels may be used by other drivers (PWM for example). read/writel are still used to access channel specific registers to avoid the performance impact of regmap (mainly locking). Tested-by: Alexander Dahl <ada@thorsis.com> Tested-by: Andras Szemzo <szemzo.andras@gmail.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 23/258 [ Author: Alexandre Belloni Email: alexandre.belloni@bootlin.com Subject: clocksource/drivers: atmel-pit: make option silent Date: Wed, 18 Apr 2018 12:51:40 +0200 To conform with the other option, make the ATMEL_PIT option silent so it can be selected from the platform Tested-by: Alexander Dahl <ada@thorsis.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 24/258 [ Author: Alexandre Belloni Email: alexandre.belloni@bootlin.com Subject: ARM: at91: Implement clocksource selection Date: Wed, 18 Apr 2018 12:51:41 +0200 Allow selecting and unselecting the PIT clocksource driver so it doesn't have to be compile when unused. Tested-by: Alexander Dahl <ada@thorsis.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 25/258 [ Author: Alexandre Belloni Email: alexandre.belloni@bootlin.com Subject: ARM: configs: at91: use new TCB timer driver Date: Wed, 18 Apr 2018 12:51:42 +0200 Unselecting ATMEL_TCLIB switches the TCB timer driver from tcb_clksrc to timer-atmel-tcb. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 26/258 [ Author: Alexandre Belloni Email: alexandre.belloni@bootlin.com Subject: ARM: configs: at91: unselect PIT Date: Wed, 18 Apr 2018 12:51:43 +0200 The PIT is not required anymore to successfully boot and may actually harm in case preempt-rt is used because the PIT interrupt is shared. Disable it so the TCB clocksource is used. Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 27/258 [ Author: Frank Rowand Email: frank.rowand@am.sony.com Subject: arm: Convert arm boot_lock to raw Date: Mon, 19 Sep 2011 14:51:14 -0700 The arm boot_lock is used by the secondary processor startup code. The locking task is the idle thread, which has idle->sched_class == &idle_sched_class. idle_sched_class->enqueue_task == NULL, so if the idle task blocks on the lock, the attempt to wake it when the lock becomes available will fail: try_to_wake_up() ... activate_task() enqueue_task() p->sched_class->enqueue_task(rq, p, flags) Fix by converting boot_lock to a raw spin lock. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> Link: http://lkml.kernel.org/r/4E77B952.3010606@am.sony.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tony Lindgren <tony@atomide.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Tested-by: Krzysztof Kozlowski <krzk@kernel.org> [Exynos5422 Linaro PM-QA] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 28/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86/ioapic: Don't let setaffinity unmask threaded EOI interrupt too early Date: Tue, 17 Jul 2018 18:25:31 +0200 There is an issue with threaded interrupts which are marked ONESHOT and using the fasteoi handler. if (IS_ONESHOT()) mask_irq(); .... .... cond_unmask_eoi_irq() chip->irq_eoi(); So if setaffinity is pending then the interrupt will be moved and then unmasked, which is wrong as it should be kept masked up to the point where the threaded handler finished. It's not a real problem, the interrupt will just be able to fire before the threaded handler has finished, though the irq masked state will be wrong for a bit. The patch below should cure the issue. It also renames the horribly misnomed functions so it becomes clear what they are supposed to do. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: add the body of the patch, use the same functions in both ifdef paths (spotted by Andy Shevchenko)] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 29/258 [ Author: Yang Shi Email: yang.shi@linaro.org Subject: arm: kprobe: replace patch_lock to raw lock Date: Thu, 10 Nov 2016 16:17:55 -0800 When running kprobe on -rt kernel, the below bug is caught: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:931 in_atomic(): 1, irqs_disabled(): 128, pid: 14, name: migration/0 INFO: lockdep is turned off. irq event stamp: 238 hardirqs last enabled at (237): [<80b5aecc>] _raw_spin_unlock_irqrestore+0x88/0x90 hardirqs last disabled at (238): [<80b56d88>] __schedule+0xec/0x94c softirqs last enabled at (0): [<80225584>] copy_process.part.5+0x30c/0x1994 softirqs last disabled at (0): [< (null)>] (null) Preemption disabled at:[<802f2b98>] cpu_stopper_thread+0xc0/0x140 CPU: 0 PID: 14 Comm: migration/0 Tainted: G O 4.8.3-rt2 #1 Hardware name: Freescale LS1021A [<80212e7c>] (unwind_backtrace) from [<8020cd2c>] (show_stack+0x20/0x24) [<8020cd2c>] (show_stack) from [<80689e14>] (dump_stack+0xa0/0xcc) [<80689e14>] (dump_stack) from [<8025a43c>] (___might_sleep+0x1b8/0x2a4) [<8025a43c>] (___might_sleep) from [<80b5b324>] (rt_spin_lock+0x34/0x74) [<80b5b324>] (rt_spin_lock) from [<80b5c31c>] (__patch_text_real+0x70/0xe8) [<80b5c31c>] (__patch_text_real) from [<80b5c3ac>] (patch_text_stop_machine+0x18/0x20) [<80b5c3ac>] (patch_text_stop_machine) from [<802f2920>] (multi_cpu_stop+0xfc/0x134) [<802f2920>] (multi_cpu_stop) from [<802f2ba0>] (cpu_stopper_thread+0xc8/0x140) [<802f2ba0>] (cpu_stopper_thread) from [<802563a4>] (smpboot_thread_fn+0x1a4/0x354) [<802563a4>] (smpboot_thread_fn) from [<80251d38>] (kthread+0x104/0x11c) [<80251d38>] (kthread) from [<80207f70>] (ret_from_fork+0x14/0x24) Since patch_text_stop_machine() is called in stop_machine() which disables IRQ, sleepable lock should be not used in this atomic context, so replace patch_lock to raw lock. Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 30/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm/unwind: use a raw_spin_lock Date: Fri, 20 Sep 2013 14:31:54 +0200 Mostly unwind is done with irqs enabled however SLUB may call it with irqs disabled while creating a new SLUB cache. I had system freeze while loading a module which called kmem_cache_create() on init. That means SLUB's __slab_alloc() disabled interrupts and then ->new_slab_objects() ->new_slab() ->setup_object() ->setup_object_debug() ->init_tracking() ->set_track() ->save_stack_trace() ->save_stack_trace_tsk() ->walk_stackframe() ->unwind_frame() ->unwind_find_idx() =>spin_lock_irqsave(&unwind_lock); Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 31/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cgroup: use irqsave in cgroup_rstat_flush_locked() Date: Tue, 3 Jul 2018 18:19:48 +0200 All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked(). Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible. Acquire the raw_spin_lock_t with disabled interrupts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 32/258 [ Author: Clark Williams Email: williams@redhat.com Subject: fscache: initialize cookie hash table raw spinlocks Date: Tue, 3 Jul 2018 13:34:30 -0500 The fscache cookie mechanism uses a hash table of hlist_bl_head structures. The PREEMPT_RT patcheset adds a raw spinlock to this structure and so on PREEMPT_RT the structures get used uninitialized, causing warnings about bad magic numbers when spinlock debugging is turned on. Use the init function for fscache cookies. Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 33/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: irqchip/gic-v3-its: Make its_lock a raw_spin_lock_t Date: Fri, 13 Jul 2018 15:30:52 +0200 The its_lock lock is held while a new device is added to the list and during setup while the CPU is booted. Even on -RT the CPU-bootup is performed with disabled interrupts. Make its_lock a raw_spin_lock_t. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 34/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: irqchip/gic-v3-its: Move ITS' ->pend_page allocation into an early CPU up hook Date: Fri, 13 Jul 2018 15:45:36 +0200 The AP-GIC-starting hook allocates memory for the ->pend_page while the CPU is started during boot-up. This callback is invoked on the target CPU with disabled interrupts. This does not work on -RT beacuse memory allocations are not possible with disabled interrupts. Move the memory allocation to an earlier hotplug step which invoked with enabled interrupts on the boot CPU. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 35/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: efi: Allow efi=runtime Date: Thu, 26 Jul 2018 15:06:10 +0200 In case the option "efi=noruntime" is default at built-time, the user could overwrite its sate by `efi=runtime' and allow it again. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 36/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86/efi: drop task_lock() from efi_switch_mm() Date: Tue, 24 Jul 2018 14:48:55 +0200 efi_switch_mm() is a wrapper around switch_mm() which saves current's ->active_mm, sets the requests mm as ->active_mm and invokes switch_mm(). I don't think that task_lock() is required during that procedure. It protects ->mm which isn't changed here. It needs to be mentioned that during the whole procedure (switch to EFI's mm and back) the preemption needs to be disabled. A context switch at this point would reset the cr3 value based on current->mm. Also, this function may not be invoked at the same time on a different CPU because it would overwrite the efi_scratch.prev_mm information. Remove task_lock() and also update the comment to reflect it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 37/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm64: KVM: compute_layout before altenates are applied Date: Thu, 26 Jul 2018 09:13:42 +0200 compute_layout() is invoked as part of an alternative fixup under stop_machine() and needs a sleeping lock as part of get_random_long(). Invoke compute_layout() before the alternatives are applied. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 38/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: NFSv4: replace seqcount_t with a seqlock_t Date: Fri, 28 Oct 2016 23:05:11 +0200 The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me because it maps to preempt_disable() in -RT which I can't have at this point. So I took a look at the code. It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because lockdep complained. The whole seqcount thing was introduced in commit c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as being recovered"). The recovery threads runs only once. write_seqlock() does not work on !RT because it disables preemption and it the writer side is preemptible (has to remain so despite the fact that it will block readers). Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 39/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel: sched: Provide a pointer to the valid CPU mask Date: Tue, 4 Apr 2017 12:50:16 +0200 In commit 4b53a3412d66 ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper") the tsk_nr_cpus_allowed() wrapper was removed. There was not much difference in !RT but in RT we used this to implement migrate_disable(). Within a migrate_disable() section the CPU mask is restricted to single CPU while the "normal" CPU mask remains untouched. As an alternative implementation Ingo suggested to use struct task_struct { const cpumask_t *cpus_ptr; cpumask_t cpus_mask; }; with t->cpus_allowed_ptr = &t->cpus_allowed; In -RT we then can switch the cpus_ptr to t->cpus_allowed_ptr = &cpumask_of(task_cpu(p)); in a migration disabled region. The rules are simple: - Code that 'uses' ->cpus_allowed would use the pointer. - Code that 'modifies' ->cpus_allowed would use the direct mask. While converting the existing users I tried to stick with the rules above however… well mostly CPUFREQ tries to temporary switch the CPU mask to do something on a certain CPU and then switches the mask back it its original value. So in theory `cpus_ptr' could or should be used. However if this is invoked in a migration disabled region (which is not the case because it would require something like preempt_disable() and set_cpus_allowed_ptr() might sleep so it can't be) then the "restore" part would restore the wrong mask. So it only looks strange and I go for the pointer… Some drivers copy the cpumask without cpumask_copy() and others use cpumask_copy but without alloc_cpumask_var(). I did not fix those as part of this, could do this as a follow up… So is this the way we want it? Is the usage of `cpus_ptr' vs `cpus_mask' for the set + restore part (see cpufreq users) what we want? At some point it looks like they should use a different interface for their doing. I am not sure why switching to certain CPU is important but maybe it could be done via a workqueue from the CPUFREQ core (so we have a comment desribing why are doing this and a get_online_cpus() to ensure that the CPU does not go offline too early). Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 40/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/sched/core: add migrate_disable() Date: Sat, 27 May 2017 19:02:06 +0200 ] 41/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm: at91: do not disable/enable clocks in a row Date: Wed, 9 Mar 2016 10:51:06 +0100 Currently the driver will disable the clock and enable it one line later if it is switching from periodic mode into one shot. This can be avoided and causes a needless warning on -RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 42/258 [ Author: Benedikt Spranger Email: b.spranger@linutronix.de Subject: clocksource: TCLIB: Allow higher clock rates for clock events Date: Mon, 8 Mar 2010 18:57:04 +0100 As default the TCLIB uses the 32KiHz base clock rate for clock events. Add a compile time selection to allow higher clock resulution. (fixed up by Sami Pietikäinen <Sami.Pietikainen@wapice.com>) Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 43/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: timekeeping: Split jiffies seqlock Date: Thu, 14 Feb 2013 22:36:59 +0100 Replace jiffies_lock seqlock with a simple seqcounter and a rawlock so it can be taken in atomic context on RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 44/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: signal: Revert ptrace preempt magic Date: Wed, 21 Sep 2011 19:57:12 +0200 Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more than a bandaid around the ptrace design trainwreck. It's not a correctness issue, it's merily a cosmetic bandaid. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 45/258 [ Author: Marc Kleine-Budde Email: mkl@pengutronix.de Subject: net: sched: Use msleep() instead of yield() Date: Wed, 5 Mar 2014 00:49:47 +0100 On PREEMPT_RT enabled systems the interrupt handler run as threads at prio 50 (by default). If a high priority userspace process tries to shut down a busy network interface it might spin in a yield loop waiting for the device to become idle. With the interrupt thread having a lower priority than the looping process it might never be scheduled and so result in a deadlock on UP systems. With Magic SysRq the following backtrace can be produced: > test_app R running 0 174 168 0x00000000 > [<c02c7070>] (__schedule+0x220/0x3fc) from [<c02c7870>] (preempt_schedule_irq+0x48/0x80) > [<c02c7870>] (preempt_schedule_irq+0x48/0x80) from [<c0008fa8>] (svc_preempt+0x8/0x20) > [<c0008fa8>] (svc_preempt+0x8/0x20) from [<c001a984>] (local_bh_enable+0x18/0x88) > [<c001a984>] (local_bh_enable+0x18/0x88) from [<c025316c>] (dev_deactivate_many+0x220/0x264) > [<c025316c>] (dev_deactivate_many+0x220/0x264) from [<c023be04>] (__dev_close_many+0x64/0xd4) > [<c023be04>] (__dev_close_many+0x64/0xd4) from [<c023be9c>] (__dev_close+0x28/0x3c) > [<c023be9c>] (__dev_close+0x28/0x3c) from [<c023f7f0>] (__dev_change_flags+0x88/0x130) > [<c023f7f0>] (__dev_change_flags+0x88/0x130) from [<c023f904>] (dev_change_flags+0x10/0x48) > [<c023f904>] (dev_change_flags+0x10/0x48) from [<c024c140>] (do_setlink+0x370/0x7ec) > [<c024c140>] (do_setlink+0x370/0x7ec) from [<c024d2f0>] (rtnl_newlink+0x2b4/0x450) > [<c024d2f0>] (rtnl_newlink+0x2b4/0x450) from [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4) > [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4) from [<c0256740>] (netlink_rcv_skb+0xac/0xc0) > [<c0256740>] (netlink_rcv_skb+0xac/0xc0) from [<c024bbd8>] (rtnetlink_rcv+0x18/0x24) > [<c024bbd8>] (rtnetlink_rcv+0x18/0x24) from [<c02561b8>] (netlink_unicast+0x13c/0x198) > [<c02561b8>] (netlink_unicast+0x13c/0x198) from [<c025651c>] (netlink_sendmsg+0x264/0x2e0) > [<c025651c>] (netlink_sendmsg+0x264/0x2e0) from [<c022af98>] (sock_sendmsg+0x78/0x98) > [<c022af98>] (sock_sendmsg+0x78/0x98) from [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278) > [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278) from [<c022cf08>] (__sys_sendmsg+0x48/0x78) > [<c022cf08>] (__sys_sendmsg+0x48/0x78) from [<c0009320>] (ret_fast_syscall+0x0/0x2c) This patch works around the problem by replacing yield() by msleep(1), giving the interrupt thread time to finish, similar to other changes contained in the rt patch set. Using wait_for_completion() instead would probably be a better solution. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 46/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: dm rq: remove BUG_ON(!irqs_disabled) check Date: Tue, 27 Mar 2018 16:24:15 +0200 In commit 052189a2ec95 ("dm: remove superfluous irq disablement in dm_request_fn") the spin_lock_irq() was replaced with spin_lock() + a check for disabled interrupts. Later the locking part was removed in commit 2eb6e1e3aa87 ("dm: submit stacked requests in irq enabled context") but the BUG_ON() check remained. Since the original purpose for the "are-irqs-off" check is gone (the ->queue_lock has been removed) remove it. Cc: Keith Busch <keith.busch@intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 47/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: usb: do no disable interrupts in giveback Date: Fri, 8 Nov 2013 17:34:54 +0100 Since commit 94dfd7ed ("USB: HCD: support giveback of URB in tasklet context") the USB code disables interrupts before invoking the complete callback. This should not be required the HCD completes the URBs either in hard-irq context or in BH context. Lockdep may report false positives if one has two HCDs (one completes in IRQ and the other in BH context) and is using the same USB driver (device) with both HCDs. This is safe since the same URBs are never mixed with those two HCDs. Longeterm we should force all HCDs to complete in the same context. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 48/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Provide PREEMPT_RT_BASE config switch Date: Fri, 17 Jun 2011 12:39:57 +0200 Introduce PREEMPT_RT_BASE which enables parts of PREEMPT_RT_FULL. Forces interrupt threading and enables some of the RT substitutions for testing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 49/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: kconfig: Add PREEMPT_RT_FULL Date: Wed, 29 Jun 2011 14:58:57 +0200 Introduce the final symbol for PREEMPT_RT_FULL. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 50/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: cpumask: Disable CONFIG_CPUMASK_OFFSTACK for RT Date: Wed, 14 Dec 2011 01:03:49 +0100 There are "valid" GFP_ATOMIC allocations such as |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:931 |in_atomic(): 1, irqs_disabled(): 0, pid: 2130, name: tar |1 lock held by tar/2130: | #0: (&mm->mmap_sem){++++++}, at: [<ffffffff811d4e89>] SyS_brk+0x39/0x190 |Preemption disabled at:[<ffffffff81063048>] flush_tlb_mm_range+0x28/0x350 | |CPU: 1 PID: 2130 Comm: tar Tainted: G W 4.8.2-rt2+ #747 |Call Trace: | [<ffffffff814d52dc>] dump_stack+0x86/0xca | [<ffffffff810a26fb>] ___might_sleep+0x14b/0x240 | [<ffffffff819bc1d4>] rt_spin_lock+0x24/0x60 | [<ffffffff81194fba>] get_page_from_freelist+0x83a/0x11b0 | [<ffffffff81195e8b>] __alloc_pages_nodemask+0x15b/0x1190 | [<ffffffff811f0b81>] alloc_pages_current+0xa1/0x1f0 | [<ffffffff811f7df5>] new_slab+0x3e5/0x690 | [<ffffffff811fb0d5>] ___slab_alloc+0x495/0x660 | [<ffffffff811fb311>] __slab_alloc.isra.79+0x71/0xc0 | [<ffffffff811fb447>] __kmalloc_node+0xe7/0x240 | [<ffffffff814d4ee0>] alloc_cpumask_var_node+0x20/0x50 | [<ffffffff814d4f3e>] alloc_cpumask_var+0xe/0x10 | [<ffffffff810430c1>] native_send_call_func_ipi+0x21/0x130 | [<ffffffff8111c13f>] smp_call_function_many+0x22f/0x370 | [<ffffffff81062b64>] native_flush_tlb_others+0x1a4/0x3a0 | [<ffffffff8106309b>] flush_tlb_mm_range+0x7b/0x350 | [<ffffffff811c88e2>] tlb_flush_mmu_tlbonly+0x62/0xd0 | [<ffffffff811c9af4>] tlb_finish_mmu+0x14/0x50 | [<ffffffff811d1c84>] unmap_region+0xe4/0x110 | [<ffffffff811d3db3>] do_munmap+0x293/0x470 | [<ffffffff811d4f8c>] SyS_brk+0x13c/0x190 | [<ffffffff810032e2>] do_fast_syscall_32+0xb2/0x2f0 | [<ffffffff819be181>] entry_SYSENTER_compat+0x51/0x60 which forbid allocations at run-time. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 51/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: jump-label: disable if stop_machine() is used Date: Wed, 8 Jul 2015 17:14:48 +0200 Some architectures are using stop_machine() while switching the opcode which leads to latency spikes. The architectures which use stop_machine() atm: - ARM stop machine - s390 stop machine The architecures which use other sorcery: - MIPS - X86 - powerpc - sparc - arm64 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: only ARM for now] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 52/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: kconfig: Disable config options which are not RT compatible Date: Sun, 24 Jul 2011 12:11:43 +0200 Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 53/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: lockdep: disable self-test Date: Tue, 17 Oct 2017 16:36:18 +0200 The self-test wasn't always 100% accurate for RT. We disabled a few tests which failed because they had a different semantic for RT. Some still reported false positives. Now the selftest locks up the system during boot and it needs to be investigated… Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 54/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm: Allow only slub on RT Date: Fri, 3 Jul 2009 08:44:03 -0500 Disable SLAB and SLOB on -RT. Only SLUB is adopted to -RT needs. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 55/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: locking: Disable spin on owner for RT Date: Sun, 17 Jul 2011 21:51:45 +0200 Drop spin on owner for mutex / rwsem. We are most likely not using it but… Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 56/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rcu: Disable RCU_FAST_NO_HZ on RT Date: Sun, 28 Oct 2012 13:26:09 +0000 This uses a timer_list timer from the irq disabled guts of the idle code. Disable it for now to prevent wreckage. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 57/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rcu: make RCU_BOOST default on RT Date: Fri, 21 Mar 2014 20:19:05 +0100 Since it is no longer invoked from the softirq people run into OOM more often if the priority of the RCU thread is too low. Making boosting default on RT should help in those case and it can be switched off if someone knows better. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 58/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Disable CONFIG_RT_GROUP_SCHED on RT Date: Mon, 18 Jul 2011 17:03:52 +0200 Carsten reported problems when running: taskset 01 chrt -f 1 sleep 1 from within rc.local on a F15 machine. The task stays running and never gets on the run queue because some of the run queues have rt_throttled=1 which does not go away. Works nice from a ssh login shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 59/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: disable NET_RX_BUSY_POLL Date: Sat, 27 May 2017 19:02:06 +0200 sk_busy_loop() does preempt_disable() followed by a few operations which can take sleeping locks and may get long. I _think_ that we could use preempt_disable_nort() (in sk_busy_loop()) instead but after a successfull cmpxchg(&napi->state, …) we would gain the ressource and could be scheduled out. At this point nobody knows who (which context) owns it and so it could take a while until the state is realeased and napi_poll() could be invoked again. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 60/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm*: disable NEON in kernel mode Date: Fri, 1 Dec 2017 10:42:03 +0100 NEON in kernel mode is used by the crypto algorithms and raid6 code. While the raid6 code looks okay, the crypto algorithms do not: NEON is enabled on first invocation and may allocate/free/map memory before the NEON mode is disabled again. This needs to be changed until it can be enabled. On ARM NEON in kernel mode can be simply disabled. on ARM64 it needs to stay on due to possible EFI callbacks so here I disable each algorithm. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 61/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: powerpc: Use generic rwsem on RT Date: Tue, 14 Jul 2015 14:26:34 +0200 Use generic code which uses rtmutex Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 62/258 [ Author: Bogdan Purcareata Email: bogdan.purcareata@freescale.com Subject: powerpc/kvm: Disable in-kernel MPIC emulation for PREEMPT_RT_FULL Date: Fri, 24 Apr 2015 15:53:13 +0000 While converting the openpic emulation code to use a raw_spinlock_t enables guests to run on RT, there's still a performance issue. For interrupts sent in directed delivery mode with a multiple CPU mask, the emulated openpic will loop through all of the VCPUs, and for each VCPUs, it call IRQ_check, which will loop through all the pending interrupts for that VCPU. This is done while holding the raw_lock, meaning that in all this time the interrupts and preemption are disabled on the host Linux. A malicious user app can max both these number and cause a DoS. This temporary fix is sent for two reasons. First is so that users who want to use the in-kernel MPIC emulation are aware of the potential latencies, thus making sure that the hardware MPIC and their usage scenario does not involve interrupts sent in directed delivery mode, and the number of possible pending interrupts is kept small. Secondly, this should incentivize the development of a proper openpic emulation that would be better suited for RT. Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 63/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: powerpc: Disable highmem on RT Date: Mon, 18 Jul 2011 17:08:34 +0200 The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 64/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mips: Disable highmem on RT Date: Mon, 18 Jul 2011 17:10:12 +0200 The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 65/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: Use generic rwsem_spinlocks on -rt Date: Sun, 26 Jul 2009 02:21:32 +0200 Simplifies the separation of anon_rw_semaphores and rw_semaphores for -rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 66/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: leds: trigger: disable CPU trigger on -RT Date: Thu, 23 Jan 2014 14:45:59 +0100 as it triggers: |CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 |[<c0014aa4>] (unwind_backtrace+0x0/0xf8) from [<c0012788>] (show_stack+0x1c/0x20) |[<c0012788>] (show_stack+0x1c/0x20) from [<c043c8dc>] (dump_stack+0x20/0x2c) |[<c043c8dc>] (dump_stack+0x20/0x2c) from [<c004c5e8>] (__might_sleep+0x13c/0x170) |[<c004c5e8>] (__might_sleep+0x13c/0x170) from [<c043f270>] (__rt_spin_lock+0x28/0x38) |[<c043f270>] (__rt_spin_lock+0x28/0x38) from [<c043fa00>] (rt_read_lock+0x68/0x7c) |[<c043fa00>] (rt_read_lock+0x68/0x7c) from [<c036cf74>] (led_trigger_event+0x2c/0x5c) |[<c036cf74>] (led_trigger_event+0x2c/0x5c) from [<c036e0bc>] (ledtrig_cpu+0x54/0x5c) |[<c036e0bc>] (ledtrig_cpu+0x54/0x5c) from [<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) |[<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) from [<c00590b8>] (cpu_startup_entry+0xa8/0x234) |[<c00590b8>] (cpu_startup_entry+0xa8/0x234) from [<c043b2cc>] (rest_init+0xb8/0xe0) |[<c043b2cc>] (rest_init+0xb8/0xe0) from [<c061ebe0>] (start_kernel+0x2c4/0x380) Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 67/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cpufreq: drop K8's driver from beeing selected Date: Thu, 9 Apr 2015 15:23:01 +0200 Ralf posted a picture of a backtrace from | powernowk8_target_fn() -> transition_frequency_fidvid() and then at the | end: | 932 policy = cpufreq_cpu_get(smp_processor_id()); | 933 cpufreq_cpu_put(policy); crashing the system on -RT. I assumed that policy was a NULL pointer but was rulled out. Since Ralf can't do any more investigations on this and I have no machine with this, I simply switch it off. Reported-by: Ralf Mardorf <ralf.mardorf@alice-dsl.net> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 68/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: md: disable bcache Date: Thu, 29 Aug 2013 11:48:57 +0200 It uses anon semaphores |drivers/md/bcache/request.c: In function ‘cached_dev_write_complete’: |drivers/md/bcache/request.c:1007:2: error: implicit declaration of function ‘up_read_non_owner’ [-Werror=implicit-function-declaration] | up_read_non_owner(&dc->writeback_lock); | ^ |drivers/md/bcache/request.c: In function ‘request_write’: |drivers/md/bcache/request.c:1033:2: error: implicit declaration of function ‘down_read_non_owner’ [-Werror=implicit-function-declaration] | down_read_non_owner(&dc->writeback_lock); | ^ either we get rid of those or we have to introduce them… Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 69/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: efi: Disable runtime services on RT Date: Thu, 26 Jul 2018 15:03:16 +0200 Based on meassurements the EFI functions get_variable / get_next_variable take up to 2us which looks okay. The functions get_time, set_time take around 10ms. Those 10ms are too much. Even one ms would be too much. Ard mentioned that SetVariable might even trigger larger latencies if the firware will erase flash blocks on NOR. The time-functions are used by efi-rtc and can be triggered during runtimed (either via explicit read/write or ntp sync). The variable write could be used by pstore. These functions can be disabled without much of a loss. The poweroff / reboot hooks may be provided by PSCI. Disable EFI's runtime wrappers. This was observed on "EFI v2.60 by SoftIron Overdrive 1000". Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 70/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: printk: Add a printk kill switch Date: Fri, 22 Jul 2011 17:58:40 +0200 Add a prinkt-kill-switch. This is used from (NMI) watchdog to ensure that it does not dead-lock with the early printk code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 71/258 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: printk: Add "force_early_printk" boot param to help with debugging Date: Fri, 2 Sep 2011 14:41:29 +0200 Gives me an option to screw printk and actually see what the machine says. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1314967289.1301.11.camel@twins Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-ykb97nsfmobq44xketrxs977@git.kernel.org ] 72/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: preempt: Provide preempt_*_(no)rt variants Date: Fri, 24 Jul 2009 12:38:56 +0200 RT needs a few preempt_disable/enable points which are not necessary otherwise. Implement variants to avoid #ifdeffery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 73/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: futex: workaround migrate_disable/enable in different context Date: Wed, 8 Mar 2017 14:23:35 +0100 migrate_disable()/migrate_enable() takes a different path in atomic() vs !atomic() context. These little hacks ensure that we don't underflow / overflow the migrate code counts properly while we lock the hb lockwith interrupts enabled and unlock it with interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 74/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Add local irq locks Date: Mon, 20 Jun 2011 09:03:47 +0200 Introduce locallock. For !RT this maps to preempt_disable()/ local_irq_disable() so there is not much that changes. For RT this will map to a spinlock. This makes preemption possible and locked "ressource" gets the lockdep anotation it wouldn't have otherwise. The locks are recursive for owner == current. Also, all locks user migrate_disable() which ensures that the task is not migrated to another CPU while the lock is held and the owner is preempted. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 75/258 [ Author: Julia Cartwright Email: julia@ni.com Subject: locallock: provide {get,put}_locked_ptr() variants Date: Mon, 7 May 2018 08:58:56 -0500 Provide a set of locallocked accessors for pointers to per-CPU data; this is useful for dynamically-allocated per-CPU regions, for example. These are symmetric with the {get,put}_cpu_ptr() per-CPU accessor variants. Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 76/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/scatterlist: Do not disable irqs on RT Date: Fri, 3 Jul 2009 08:44:34 -0500 For -RT it is enough to keep pagefault disabled (which is currently handled by kmap_atomic()). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 77/258 [ Author: Oleg Nesterov Email: oleg@redhat.com Subject: signal/x86: Delay calling signals in atomic Date: Tue, 14 Jul 2015 14:26:34 +0200 On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT_FULL is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT_FULL to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 78/258 [ Author: Yang Shi Email: yang.shi@linaro.org Subject: x86/signal: delay calling signals on 32bit Date: Thu, 10 Dec 2015 10:58:51 -0800 When running some ptrace single step tests on x86-32 machine, the below problem is triggered: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 1, irqs_disabled(): 0, pid: 1041, name: dummy2 Preemption disabled at:[<c100326f>] do_debug+0x1f/0x1a0 CPU: 10 PID: 1041 Comm: dummy2 Tainted: G W 4.1.13-rt13 #1 Call Trace: [<c1aa8306>] dump_stack+0x46/0x5c [<c1080517>] ___might_sleep+0x137/0x220 [<c1ab0eff>] rt_spin_lock+0x1f/0x80 [<c1064b5a>] do_force_sig_info+0x2a/0xc0 [<c106567d>] force_sig_info+0xd/0x10 [<c1010cff>] send_sigtrap+0x6f/0x80 [<c10033b1>] do_debug+0x161/0x1a0 [<c1ab2921>] debug_stack_correct+0x2e/0x35 This happens since 959274753857 ("x86, traps: Track entry into and exit from IST context") which was merged in v4.1-rc1. Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 79/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: buffer_head: Replace bh_uptodate_lock for -rt Date: Fri, 18 Mar 2011 09:18:52 +0100 Wrap the bit_spin_lock calls into a separate inline and add the RT replacements with a real spinlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 80/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: fs: jbd/jbd2: Make state lock and journal head lock rt safe Date: Fri, 18 Mar 2011 10:11:25 +0100 bit_spin_locks break under RT. Based on a previous patch from Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> -- include/linux/buffer_head.h | 8 ++++++++ include/linux/jbd2.h | 24 ++++++++++++++++++++++++ 2 files changed, 32 insertions(+) ] 81/258 [ Author: Paul Gortmaker Email: paul.gortmaker@windriver.com Subject: list_bl: Make list head locking RT safe Date: Fri, 21 Jun 2013 15:07:25 -0400 As per changes in include/linux/jbd_common.h for avoiding the bit_spin_locks on RT ("fs: jbd/jbd2: Make state lock and journal head lock rt safe") we do the same thing here. We use the non atomic __set_bit and __clear_bit inside the scope of the lock to preserve the ability of the existing LIST_DEBUG code to use the zero'th bit in the sanity checks. As a bit spinlock, we had no lockdep visibility into the usage of the list head locking. Now, if we were to implement it as a standard non-raw spinlock, we would see: BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 in_atomic(): 1, irqs_disabled(): 0, pid: 122, name: udevd 5 locks held by udevd/122: #0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffff811967e8>] lock_rename+0xe8/0xf0 #1: (rename_lock){+.+...}, at: [<ffffffff811a277c>] d_move+0x2c/0x60 #2: (&dentry->d_lock){+.+...}, at: [<ffffffff811a0763>] dentry_lock_for_move+0xf3/0x130 #3: (&dentry->d_lock/2){+.+...}, at: [<ffffffff811a0734>] dentry_lock_for_move+0xc4/0x130 #4: (&dentry->d_lock/3){+.+...}, at: [<ffffffff811a0747>] dentry_lock_for_move+0xd7/0x130 Pid: 122, comm: udevd Not tainted 3.4.47-rt62 #7 Call Trace: [<ffffffff810b9624>] __might_sleep+0x134/0x1f0 [<ffffffff817a24d4>] rt_spin_lock+0x24/0x60 [<ffffffff811a0c4c>] __d_shrink+0x5c/0xa0 [<ffffffff811a1b2d>] __d_drop+0x1d/0x40 [<ffffffff811a24be>] __d_move+0x8e/0x320 [<ffffffff811a278e>] d_move+0x3e/0x60 [<ffffffff81199598>] vfs_rename+0x198/0x4c0 [<ffffffff8119b093>] sys_renameat+0x213/0x240 [<ffffffff817a2de5>] ? _raw_spin_unlock+0x35/0x60 [<ffffffff8107781c>] ? do_page_fault+0x1ec/0x4b0 [<ffffffff817a32ca>] ? retint_swapgs+0xe/0x13 [<ffffffff813eb0e6>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8119b0db>] sys_rename+0x1b/0x20 [<ffffffff817a3b96>] system_call_fastpath+0x1a/0x1f Since we are only taking the lock during short lived list operations, lets assume for now that it being raw won't be a significant latency concern. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 82/258 [ Author: Josh Cartwright Email: joshc@ni.com Subject: list_bl: fixup bogus lockdep warning Date: Thu, 31 Mar 2016 00:04:25 -0500 At first glance, the use of 'static inline' seems appropriate for INIT_HLIST_BL_HEAD(). However, when a 'static inline' function invocation is inlined by gcc, all callers share any static local data declared within that inline function. This presents a problem for how lockdep classes are setup. raw_spinlocks, for example, when CONFIG_DEBUG_SPINLOCK, # define raw_spin_lock_init(lock) \ do { \ static struct lock_class_key __key; \ \ __raw_spin_lock_init((lock), #lock, &__key); \ } while (0) When this macro is expanded into a 'static inline' caller, like INIT_HLIST_BL_HEAD(): static inline INIT_HLIST_BL_HEAD(struct hlist_bl_head *h) { h->first = NULL; raw_spin_lock_init(&h->lock); } ...the static local lock_class_key object is made a function static. For compilation units which initialize invoke INIT_HLIST_BL_HEAD() more than once, then, all of the invocations share this same static local object. This can lead to some very confusing lockdep splats (example below). Solve this problem by forcing the INIT_HLIST_BL_HEAD() to be a macro, which prevents the lockdep class object sharing. ============================================= [ INFO: possible recursive locking detected ] 4.4.4-rt11 #4 Not tainted --------------------------------------------- kswapd0/59 is trying to acquire lock: (&h->lock#2){+.+.-.}, at: mb_cache_shrink_scan but task is already holding lock: (&h->lock#2){+.+.-.}, at: mb_cache_shrink_scan other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&h->lock#2); lock(&h->lock#2); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by kswapd0/59: #0: (shrinker_rwsem){+.+...}, at: rt_down_read_trylock #1: (&h->lock#2){+.+.-.}, at: mb_cache_shrink_scan Reported-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Tested-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 83/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: genirq: Disable irqpoll on -rt Date: Fri, 3 Jul 2009 08:29:57 -0500 Creates long latencies for no value Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 84/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: genirq: Force interrupt thread on RT Date: Sun, 3 Apr 2011 11:57:29 +0200 Force threaded_irqs and optimize the code (force_irqthreads) in regard to this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 85/258 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: Split IRQ-off and zone->lock while freeing pages from PCP list #2 Date: Mon, 28 May 2018 15:24:21 +0200 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 86/258 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: Split IRQ-off and zone->lock while freeing pages from PCP list #2 Date: Mon, 28 May 2018 15:24:21 +0200 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 87/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/SLxB: change list_lock to raw_spinlock_t Date: Mon, 28 May 2018 15:24:22 +0200 The list_lock is used with used with IRQs off on RT. Make it a raw_spinlock_t otherwise the interrupts won't be disabled on -RT. The locking rules remain the same on !RT. This patch changes it for SLAB and SLUB since both share the same header file for struct kmem_cache_node defintion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 88/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/SLUB: delay giving back empty slubs to IRQ enabled regions Date: Thu, 21 Jun 2018 17:29:19 +0200 __free_slab() is invoked with disabled interrupts which increases the irq-off time while __free_pages() is doing the work. Allow __free_slab() to be invoked with enabled interrupts and move everything from interrupts-off invocations to a temporary per-CPU list so it can be processed later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 89/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm: page_alloc: rt-friendly per-cpu pages Date: Fri, 3 Jul 2009 08:29:37 -0500 rt-friendly per-cpu pages: convert the irqs-off per-cpu locking method into a preemptible, explicit-per-cpu-locks method. Contains fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl> Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 90/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm/swap: Convert to percpu locked Date: Fri, 3 Jul 2009 08:29:51 -0500 Replace global locks (get_cpu + local_irq_save) with "local_locks()". Currently there is one of for "rotate" and one for "swap". Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 91/258 [ Author: Luiz Capitulino Email: lcapitulino@redhat.com Subject: mm: perform lru_add_drain_all() remotely Date: Fri, 27 May 2016 15:03:28 +0200 lru_add_drain_all() works by scheduling lru_add_drain_cpu() to run on all CPUs that have non-empty LRU pagevecs and then waiting for the scheduled work to complete. However, workqueue threads may never have the chance to run on a CPU that's running a SCHED_FIFO task. This causes lru_add_drain_all() to block forever. This commit solves this problem by changing lru_add_drain_all() to drain the LRU pagevecs of remote CPUs. This is done by grabbing swapvec_lock and calling lru_add_drain_cpu(). PS: This is based on an idea and initial implementation by Rik van Riel. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 92/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: mm/vmstat: Protect per cpu variables with preempt disable on RT Date: Fri, 3 Jul 2009 08:30:13 -0500 Disable preemption on -RT for the vmstat code. On vanila the code runs in IRQ-off regions while on -RT it is not. "preempt_disable" ensures that the same ressources is not updated in parallel due to preemption. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 93/258 [ Author: Frank Rowand Email: frank.rowand@am.sony.com Subject: ARM: Initialize split page table locks for vector page Date: Sat, 1 Oct 2011 18:58:13 -0700 Without this patch, ARM can not use SPLIT_PTLOCK_CPUS if PREEMPT_RT_FULL=y because vectors_user_mapping() creates a VM_ALWAYSDUMP mapping of the vector page (address 0xffff0000), but no ptl->lock has been allocated for the page. An attempt to coredump that page will result in a kernel NULL pointer dereference when follow_page() attempts to lock the page. The call tree to the NULL pointer dereference is: do_notify_resume() get_signal_to_deliver() do_coredump() elf_core_dump() get_dump_page() __get_user_pages() follow_page() pte_offset_map_lock() <----- a #define ... rt_spin_lock() The underlying problem is exposed by mm-shrink-the-page-frame-to-rt-size.patch. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Frank <Frank_Rowand@sonyusa.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/4E87C535.2030907@am.sony.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 94/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm: Enable SLUB for RT Date: Thu, 25 Oct 2012 10:32:35 +0100 Avoid the memory allocation in IRQ section Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: factor out everything except the kcalloc() workaorund ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 95/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: slub: Enable irqs for __GFP_WAIT Date: Wed, 9 Jan 2013 12:08:15 +0100 SYSTEM_RUNNING might be too late for enabling interrupts. Allocations with GFP_WAIT can happen before that. So use this as an indicator. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 96/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: slub: Disable SLUB_CPU_PARTIAL Date: Wed, 15 Apr 2015 19:00:47 +0200 |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 1, irqs_disabled(): 0, pid: 87, name: rcuop/7 |1 lock held by rcuop/7/87: | #0: (rcu_callback){......}, at: [<ffffffff8112c76a>] rcu_nocb_kthread+0x1ca/0x5d0 |Preemption disabled at:[<ffffffff811eebd9>] put_cpu_partial+0x29/0x220 | |CPU: 0 PID: 87 Comm: rcuop/7 Tainted: G W 4.0.0-rt0+ #477 |Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 | 000000000007a9fc ffff88013987baf8 ffffffff817441c7 0000000000000007 | 0000000000000000 ffff88013987bb18 ffffffff810eee51 0000000000000000 | ffff88013fc10200 ffff88013987bb48 ffffffff8174a1c4 000000000007a9fc |Call Trace: | [<ffffffff817441c7>] dump_stack+0x4f/0x90 | [<ffffffff810eee51>] ___might_sleep+0x121/0x1b0 | [<ffffffff8174a1c4>] rt_spin_lock+0x24/0x60 | [<ffffffff811a689a>] __free_pages_ok+0xaa/0x540 | [<ffffffff811a729d>] __free_pages+0x1d/0x30 | [<ffffffff811eddd5>] __free_slab+0xc5/0x1e0 | [<ffffffff811edf46>] free_delayed+0x56/0x70 | [<ffffffff811eecfd>] put_cpu_partial+0x14d/0x220 | [<ffffffff811efc98>] __slab_free+0x158/0x2c0 | [<ffffffff811f0021>] kmem_cache_free+0x221/0x2d0 | [<ffffffff81204d0c>] file_free_rcu+0x2c/0x40 | [<ffffffff8112c7e3>] rcu_nocb_kthread+0x243/0x5d0 | [<ffffffff810e951c>] kthread+0xfc/0x120 | [<ffffffff8174abc8>] ret_from_fork+0x58/0x90 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 97/258 [ Author: Yang Shi Email: yang.shi@windriver.com Subject: mm/memcontrol: Don't call schedule_work_on in preemption disabled context Date: Wed, 30 Oct 2013 11:48:33 -0700 The following trace is triggered when running ltp oom test cases: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 17188, name: oom03 Preemption disabled at:[<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 CPU: 2 PID: 17188 Comm: oom03 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffff88007684d730 ffff880070df9b58 ffffffff8169918d ffff880070df9b70 ffffffff8106db31 ffff88007688b4a0 ffff880070df9b88 ffffffff8169d9c0 ffff88007688b4a0 ffff880070df9bc8 ffffffff81059da1 0000000170df9bb0 Call Trace: [<ffffffff8169918d>] dump_stack+0x19/0x1b [<ffffffff8106db31>] __might_sleep+0xf1/0x170 [<ffffffff8169d9c0>] rt_spin_lock+0x20/0x50 [<ffffffff81059da1>] queue_work_on+0x61/0x100 [<ffffffff8112b361>] drain_all_stock+0xe1/0x1c0 [<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 [<ffffffff8112beda>] __mem_cgroup_try_charge+0x41a/0xc40 [<ffffffff810f1c91>] ? release_pages+0x1b1/0x1f0 [<ffffffff8106f200>] ? sched_exec+0x40/0xb0 [<ffffffff8112cc87>] mem_cgroup_charge_common+0x37/0x70 [<ffffffff8112e2c6>] mem_cgroup_newpage_charge+0x26/0x30 [<ffffffff8110af68>] handle_pte_fault+0x618/0x840 [<ffffffff8103ecf6>] ? unpin_current_cpu+0x16/0x70 [<ffffffff81070f94>] ? migrate_enable+0xd4/0x200 [<ffffffff8110cde5>] handle_mm_fault+0x145/0x1e0 [<ffffffff810301e1>] __do_page_fault+0x1a1/0x4c0 [<ffffffff8169c9eb>] ? preempt_schedule_irq+0x4b/0x70 [<ffffffff8169e3b7>] ? retint_kernel+0x37/0x40 [<ffffffff8103053e>] do_page_fault+0xe/0x10 [<ffffffff8169e4c2>] page_fault+0x22/0x30 So, to prevent schedule_work_on from being called in preempt disabled context, replace the pair of get/put_cpu() to get/put_cpu_light(). Signed-off-by: Yang Shi <yang.shi@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 98/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: mm/memcontrol: Replace local_irq_disable with local locks Date: Wed, 28 Jan 2015 17:14:16 +0100 There are a few local_irq_disable() which then take sleeping locks. This patch converts them local locks. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 99/258 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: mm/zsmalloc: copy with get_cpu_var() and locking Date: Tue, 22 Mar 2016 11:16:09 +0100 get_cpu_var() disables preemption and triggers a might_sleep() splat later. This is replaced with get_locked_var(). This bitspinlocks are replaced with a proper mutex which requires a slightly larger struct to allocate. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [bigeasy: replace the bitspin_lock() with a mutex, get_locked_var(). Mike then fixed the size magic] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 100/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: radix-tree: use local locks Date: Wed, 25 Jan 2017 16:34:27 +0100 The preload functionality uses per-CPU variables and preempt-disable to ensure that it does not switch CPUs during its usage. This patch adds local_locks() instead preempt_disable() for the same purpose and to remain preemptible on -RT. Cc: stable-rt@vger.kernel.org Reported-and-debugged-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 101/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: timers: Prepare for full preemption Date: Fri, 3 Jul 2009 08:29:34 -0500 When softirqs can be preempted we need to make sure that cancelling the timer from the active thread can not deadlock vs. a running timer callback. Add a waitqueue to resolve that. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 102/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: kvm Require const tsc for RT Date: Sun, 6 Nov 2011 12:26:18 +0100 Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 103/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: wait.h: include atomic.h Date: Mon, 28 Oct 2013 12:19:57 +0100 | CC init/main.o |In file included from include/linux/mmzone.h:9:0, | from include/linux/gfp.h:4, | from include/linux/kmod.h:22, | from include/linux/module.h:13, | from init/main.c:15: |include/linux/wait.h: In function ‘wait_on_atomic_t’: |include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration] | if (atomic_read(val) == 0) | ^ This pops up on ARM. Non-RT gets its atomic.h include from spinlock.h Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 104/258 [ Author: Daniel Wagner Email: daniel.wagner@bmw-carit.de Subject: work-simple: Simple work queue implemenation Date: Fri, 11 Jul 2014 15:26:11 +0200 Provides a framework for enqueuing callbacks from irq context PREEMPT_RT_FULL safe. The callbacks are executed in kthread context. Bases on wait-simple. Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> ] 105/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: completion: Use simple wait queues Date: Fri, 11 Jan 2013 11:23:51 +0100 Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 106/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/aio: simple simple work Date: Mon, 16 Feb 2015 18:49:10 +0100 |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:768 |in_atomic(): 1, irqs_disabled(): 0, pid: 26, name: rcuos/2 |2 locks held by rcuos/2/26: | #0: (rcu_callback){.+.+..}, at: [<ffffffff810b1a12>] rcu_nocb_kthread+0x1e2/0x380 | #1: (rcu_read_lock_sched){.+.+..}, at: [<ffffffff812acd26>] percpu_ref_kill_rcu+0xa6/0x1c0 |Preemption disabled at:[<ffffffff810b1a93>] rcu_nocb_kthread+0x263/0x380 |Call Trace: | [<ffffffff81582e9e>] dump_stack+0x4e/0x9c | [<ffffffff81077aeb>] __might_sleep+0xfb/0x170 | [<ffffffff81589304>] rt_spin_lock+0x24/0x70 | [<ffffffff811c5790>] free_ioctx_users+0x30/0x130 | [<ffffffff812ace34>] percpu_ref_kill_rcu+0x1b4/0x1c0 | [<ffffffff810b1a93>] rcu_nocb_kthread+0x263/0x380 | [<ffffffff8106e046>] kthread+0xd6/0xf0 | [<ffffffff81591eec>] ret_from_fork+0x7c/0xb0 replace this preempt_disable() friendly swork. Reported-By: Mike Galbraith <umgwanakikbuti@gmail.com> Suggested-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 107/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: genirq: Do not invoke the affinity callback via a workqueue on RT Date: Wed, 21 Aug 2013 17:48:46 +0200 Joe Korty reported, that __irq_set_affinity_locked() schedules a workqueue while holding a rawlock which results in a might_sleep() warning. This patch uses swork_queue() instead. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 108/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: time/hrtimer: avoid schedule_work() with interrupts disabled Date: Wed, 15 Nov 2017 17:29:51 +0100 The NOHZ code tries to schedule a workqueue with interrupts disabled. Since this does not work -RT I am switching it to swork instead. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 109/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: hrtimer: consolidate hrtimer_init() + hrtimer_init_sleeper() calls Date: Mon, 4 Sep 2017 18:31:50 +0200 hrtimer_init_sleeper() calls require a prior initialisation of the hrtimer object with hrtimer_init(). Lets make the initialisation of the hrtimer object part of hrtimer_init_sleeper(). To remain consistent consider init_on_stack as well. Beside adapting the hrtimer_init_sleeper[_on_stack]() functions, call sites need to be updated as well. [anna-maria: Updating the commit message] Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 110/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: hrtimers: Prepare full preemption Date: Fri, 3 Jul 2009 08:29:34 -0500 Make cancellation of a running callback in softirq context safe against preemption. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 111/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: hrtimer: by timers by default into the softirq context Date: Fri, 3 Jul 2009 08:44:31 -0500 We can't have hrtimers callbacks running in hardirq context on RT. Therefore the timers are deferred to the softirq context by default. There are few timers which expect to be run in hardirq context even on RT. Those are: - very short running where low latency is critical (kvm lapic) - timers which take raw locks and need run in hard-irq context (perf, sched) - wake up related timer (kernel side of clock_nanosleep() and so on) Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 112/258 [ Author: Yang Shi Email: yang.shi@windriver.com Subject: hrtimer: Move schedule_work call to helper thread Date: Mon, 16 Sep 2013 14:09:19 -0700 When run ltp leapsec_timer test, the following call trace is caught: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1 Preemption disabled at:[<ffffffff810857f3>] cpu_startup_entry+0x133/0x310 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffffffff81c2f800 ffff880076843e40 ffffffff8169918d ffff880076843e58 ffffffff8106db31 ffff88007684b4a0 ffff880076843e70 ffffffff8169d9c0 ffff88007684b4a0 ffff880076843eb0 ffffffff81059da1 0000001876851200 Call Trace: <IRQ> [<ffffffff8169918d>] dump_stack+0x19/0x1b [<ffffffff8106db31>] __might_sleep+0xf1/0x170 [<ffffffff8169d9c0>] rt_spin_lock+0x20/0x50 [<ffffffff81059da1>] queue_work_on+0x61/0x100 [<ffffffff81065aa1>] clock_was_set_delayed+0x21/0x30 [<ffffffff810883be>] do_timer+0x40e/0x660 [<ffffffff8108f487>] tick_do_update_jiffies64+0xf7/0x140 [<ffffffff8108fe42>] tick_check_idle+0x92/0xc0 [<ffffffff81044327>] irq_enter+0x57/0x70 [<ffffffff816a040e>] smp_apic_timer_interrupt+0x3e/0x9b [<ffffffff8169f80a>] apic_timer_interrupt+0x6a/0x70 <EOI> [<ffffffff8155ea1c>] ? cpuidle_enter_state+0x4c/0xc0 [<ffffffff8155eb68>] cpuidle_idle_call+0xd8/0x2d0 [<ffffffff8100b59e>] arch_cpu_idle+0xe/0x30 [<ffffffff8108585e>] cpu_startup_entry+0x19e/0x310 [<ffffffff8168efa2>] start_secondary+0x1ad/0x1b0 The clock_was_set_delayed is called in hard IRQ handler (timer interrupt), which calls schedule_work. Under PREEMPT_RT_FULL, schedule_work calls spinlocks which could sleep, so it's not safe to call schedule_work in interrupt context. Reference upstream commit b68d61c705ef02384c0538b8d9374545097899ca (rt,ntp: Move call to schedule_delayed_work() to helper thread) from git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git, which makes a similar change. Signed-off-by: Yang Shi <yang.shi@windriver.com> [bigeasy: use swork_queue() instead a helper thread] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 113/258 [ Author: John Stultz Email: johnstul@us.ibm.com Subject: posix-timers: Thread posix-cpu-timers on -rt Date: Fri, 3 Jul 2009 08:29:58 -0500 posix-cpu-timer code takes non -rt safe locks in hard irq context. Move it to a thread. [ 3.0 fixes from Peter Zijlstra <peterz@infradead.org> ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 114/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Move task_struct cleanup to RCU Date: Tue, 31 May 2011 16:59:16 +0200 __put_task_struct() does quite some expensive work. We don't want to burden random tasks with that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 115/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Limit the number of task migrations per batch Date: Mon, 6 Jun 2011 12:12:51 +0200 Put an upper limit on the number of tasks which are migrated per batch to avoid large latencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 116/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Move mmdrop to RCU on RT Date: Mon, 6 Jun 2011 12:20:33 +0200 Takes sleeping locks and calls into the memory allocator, so nothing we want to do in task switch and oder atomic contexts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 117/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/sched: move stack + kprobe clean up to __put_task_struct() Date: Mon, 21 Nov 2016 19:31:08 +0100 There is no need to free the stack before the task struct (except for reasons mentioned in commit 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")). This also comes handy on -RT because we can't free memory in preempt disabled region. Cc: stable-rt@vger.kernel.org #for kprobe_flush_task() Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 118/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Add saved_state for tasks blocked on sleeping locks Date: Sat, 25 Jun 2011 09:21:04 +0200 Spinlocks are state preserving in !RT. RT changes the state when a task gets blocked on a lock. So we need to remember the state before the lock contention. If a regular wakeup (not a RTmutex related wakeup) happens, the saved_state is updated to running. When the lock sleep is done, the saved state is restored. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 119/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Do not account rcu_preempt_depth on RT in might_sleep() Date: Tue, 7 Jun 2011 09:19:06 +0200 RT changes the rcu_preempt_depth semantics, so we cannot check for it in might_sleep(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 120/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Use the proper LOCK_OFFSET for cond_resched() Date: Sun, 17 Jul 2011 22:51:33 +0200 RT does not increment preempt count when a 'sleeping' spinlock is locked. Update PREEMPT_LOCK_OFFSET for that case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 121/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Disable TTWU_QUEUE on RT Date: Tue, 13 Sep 2011 16:42:35 +0200 The queued remote wakeup mechanism can introduce rather large latencies if the number of migrated tasks is high. Disable it for RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 122/258 [ Author: Steven Rostedt Email: rostedt@goodmis.org Subject: sched/workqueue: Only wake up idle workers if not blocked on sleeping spin lock Date: Mon, 18 Mar 2013 15:12:49 -0400 In -rt, most spin_locks() turn into mutexes. One of these spin_lock conversions is performed on the workqueue gcwq->lock. When the idle worker is worken, the first thing it will do is grab that same lock and it too will block, possibly jumping into the same code, but because nr_running would already be decremented it prevents an infinite loop. But this is still a waste of CPU cycles, and it doesn't follow the method of mainline, as new workers should only be woken when a worker thread is truly going to sleep, and not just blocked on a spin_lock(). Check the saved_state too before waking up new workers. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 123/258 [ Author: Daniel Bristot de Oliveira Email: bristot@redhat.com Subject: rt: Increase/decrease the nr of migratory tasks when enabling/disabling migration Date: Mon, 26 Jun 2017 17:07:15 +0200 There is a problem in the migrate_disable()/enable() implementation regarding the number of migratory tasks in the rt/dl RQs. The problem is the following: When a task is attached to the rt runqueue, it is checked if it either can run in more than one CPU, or if it is with migration disable. If either check is true, the rt_rq->rt_nr_migratory counter is not increased. The counter increases otherwise. When the task is detached, the same check is done. If either check is true, the rt_rq->rt_nr_migratory counter is not decreased. The counter decreases otherwise. The same check is done in the dl scheduler. One important thing is that, migrate disable/enable does not touch this counter for tasks attached to the rt rq. So suppose the following chain of events. Assumptions: Task A is the only runnable task in A Task B runs on the CPU B Task A runs on CFS (non-rt) Task B has RT priority Thus, rt_nr_migratory is 0 B is running Task A can run on all CPUS. Timeline: CPU A/TASK A CPU B/TASK B A takes the rt mutex X . A disables migration . . B tries to take the rt mutex X . As it is held by A { . A inherits the rt priority of B . A is dequeued from CFS RQ of CPU A . A is enqueued in the RT RQ of CPU A . As migration is disabled . rt_nr_migratory in A is not increased . A enables migration A releases the rt mutex X { A returns to its original priority A ask to be dequeued from RT RQ { As migration is now enabled and it can run on all CPUS { rt_nr_migratory should be decreased As rt_nr_migratory is 0, rt_nr_migratory under flows } } This variable is important because it notifies if there are more than one runnable & migratory task in the runqueue. If there are more than one tasks, the rt_rq is set as overloaded, and then tries to migrate some tasks. This rule is important to keep the scheduler working conserving, that is, in a system with M CPUs, the M highest priority tasks should be running. As rt_nr_migratory is unsigned, it will become > 0, notifying that the RQ is overloaded, activating pushing mechanism without need. This patch fixes this problem by decreasing/increasing the rt/dl_nr_migratory in the migrate disable/enable operations. Reported-by: Pei Zhang <pezhang@redhat.com> Reported-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: LKML <linux-kernel@vger.kernel.org> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 124/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: hotplug: Lightweight get online cpus Date: Wed, 15 Jun 2011 12:36:06 +0200 get_online_cpus() is a heavy weight function which involves a global mutex. migrate_disable() wants a simpler construct which prevents only a CPU from going doing while a task is in a migrate disabled section. Implement a per cpu lockless mechanism, which serializes only in the real unplug case on a global mutex. That serialization affects only tasks on the cpu which should be brought down. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 125/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: trace: Add migrate-disabled counter to tracing output Date: Sun, 17 Jul 2011 21:56:42 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 126/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: lockdep: Make it RT aware Date: Sun, 17 Jul 2011 18:51:23 +0200 teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 127/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: tasklet: Prevent tasklets from going into infinite spin in RT Date: Tue, 29 Nov 2011 20:18:22 -0500 When CONFIG_PREEMPT_RT_FULL is enabled, tasklets run as threads, and spinlocks turn are mutexes. But this can cause issues with tasks disabling tasklets. A tasklet runs under ksoftirqd, and if a tasklets are disabled with tasklet_disable(), the tasklet count is increased. When a tasklet runs, it checks this counter and if it is set, it adds itself back on the softirq queue and returns. The problem arises in RT because ksoftirq will see that a softirq is ready to run (the tasklet softirq just re-armed itself), and will not sleep, but instead run the softirqs again. The tasklet softirq will still see that the count is non-zero and will not execute the tasklet and requeue itself on the softirq again, which will cause ksoftirqd to run it again and again and again. It gets worse because ksoftirqd runs as a real-time thread. If it preempted the task that disabled tasklets, and that task has migration disabled, or can't run for other reasons, the tasklet softirq will never run because the count will never be zero, and ksoftirqd will go into an infinite loop. As an RT task, it this becomes a big problem. This is a hack solution to have tasklet_disable stop tasklets, and when a tasklet runs, instead of requeueing the tasklet softirqd it delays it. When tasklet_enable() is called, and tasklets are waiting, then the tasklet_enable() will kick the tasklets to continue. This prevents the lock up from ksoftirq going into an infinite loop. [ rostedt@goodmis.org: ported to 3.0-rt ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 128/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Check preemption after reenabling interrupts Date: Sun, 13 Nov 2011 17:17:09 +0100 raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 129/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Disable softirq stacks for RT Date: Mon, 18 Jul 2011 13:59:17 +0200 Disable extra stacks for softirqs. We want to preempt softirqs and having them on special IRQ-stack does not make this easier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 130/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: softirq: Split softirq locks Date: Thu, 4 Oct 2012 14:20:47 +0100 The 3.x RT series removed the split softirq implementation in favour of pushing softirq processing into the context of the thread which raised it. Though this prevents us from handling the various softirqs at different priorities. Now instead of reintroducing the split softirq threads we split the locks which serialize the softirq processing. If a softirq is raised in context of a thread, then the softirq is noted on a per thread field, if the thread is in a bh disabled region. If the softirq is raised from hard interrupt context, then the bit is set in the flag field of ksoftirqd and ksoftirqd is invoked. When a thread leaves a bh disabled region, then it tries to execute the softirqs which have been raised in its own context. It acquires the per softirq / per cpu lock for the softirq and then checks, whether the softirq is still pending in the per cpu local_softirq_pending() field. If yes, it runs the softirq. If no, then some other task executed it already. This allows for zero config softirq elevation in the context of user space tasks or interrupt threads. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 131/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: use local_bh_disable() in netif_rx_ni() Date: Fri, 16 Jun 2017 19:03:16 +0200 In 2004 netif_rx_ni() gained a preempt_disable() section around netif_rx() and its do_softirq() + testing for it. The do_softirq() part is required because netif_rx() raises the softirq but does not invoke it. The preempt_disable() is required to remain on the same CPU which added the skb to the per-CPU list. All this can be avoided be putting this into a local_bh_disable()ed section. The local_bh_enable() part will invoke do_softirq() if required. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 132/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: genirq: Allow disabling of softirq processing in irq thread context Date: Tue, 31 Jan 2012 13:01:27 +0100 The processing of softirqs in irq thread context is a performance gain for the non-rt workloads of a system, but it's counterproductive for interrupts which are explicitely related to the realtime workload. Allow such interrupts to prevent softirq processing in their thread context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 133/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: softirq: split timer softirqs out of ksoftirqd Date: Wed, 20 Jan 2016 16:34:17 +0100 The softirqd runs in -RT with SCHED_FIFO (prio 1) and deals mostly with timer wakeup which can not happen in hardirq context. The prio has been risen from the normal SCHED_OTHER so the timer wakeup does not happen too late. With enough networking load it is possible that the system never goes idle and schedules ksoftirqd and everything else with a higher priority. One of the tasks left behind is one of RCU's threads and so we see stalls and eventually run out of memory. This patch moves the TIMER and HRTIMER softirqs out of the `ksoftirqd` thread into its own `ktimersoftd`. The former can now run SCHED_OTHER (same as mainline) and the latter at SCHED_FIFO due to the wakeups. From networking point of view: The NAPI callback runs after the network interrupt thread completes. If its run time takes too long the NAPI code itself schedules the `ksoftirqd`. Here in the thread it can run at SCHED_OTHER priority and it won't defer RCU anymore. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 134/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rtmutex: trylock is okay on -RT Date: Wed, 2 Dec 2015 11:34:07 +0100 non-RT kernel could deadlock on rt_mutex_trylock() in softirq context. On -RT we don't run softirqs in IRQ context but in thread context so it is not a issue here. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 135/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/nfs: turn rmdir_sem into a semaphore Date: Thu, 15 Sep 2016 10:51:27 +0200 The RW semaphore had a reader side which used the _non_owner version because it most likely took the reader lock in one thread and released it in another which would cause lockdep to complain if the "regular" version was used. On -RT we need the owner because the rw lock is turned into a rtmutex. The semaphores on the hand are "plain simple" and should work as expected. We can't have multiple readers but on -RT we don't allow multiple readers anyway so that is not a loss. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 136/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Handle the various new futex race conditions Date: Fri, 10 Jun 2011 11:04:15 +0200 RT opens a few new interesting race conditions in the rtmutex/futex combo due to futex hash bucket lock being a 'sleeping' spinlock and therefor not disabling preemption. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 137/258 [ Author: Steven Rostedt Email: rostedt@goodmis.org Subject: futex: Fix bug on when a requeued RT task times out Date: Tue, 14 Jul 2015 14:26:34 +0200 Requeue with timeout causes a bug with PREEMPT_RT_FULL. The bug comes from a timed out condition. TASK 1 TASK 2 ------ ------ futex_wait_requeue_pi() futex_wait_queue_me() <timed out> double_lock_hb(); raw_spin_lock(pi_lock); if (current->pi_blocked_on) { } else { current->pi_blocked_on = PI_WAKE_INPROGRESS; run_spin_unlock(pi_lock); spin_lock(hb->lock); <-- blocked! plist_for_each_entry_safe(this) { rt_mutex_start_proxy_lock(); task_blocks_on_rt_mutex(); BUG_ON(task->pi_blocked_on)!!!! The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to grab the hb->lock, which it fails to do so. As the hb->lock is a mutex, it will block and set the "pi_blocked_on" to the hb->lock. When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails because the task1's pi_blocked_on is no longer set to that, but instead, set to the hb->lock. The fix: When calling rt_mutex_start_proxy_lock() a check is made to see if the proxy tasks pi_blocked_on is set. If so, exit out early. Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies the proxy task that it is being requeued, and will handle things appropriately. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 138/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: futex: Ensure lock/unlock symetry versus pi_lock and hash bucket lock Date: Fri, 1 Mar 2013 11:17:42 +0100 In exit_pi_state_list() we have the following locking construct: spin_lock(&hb->lock); raw_spin_lock_irq(&curr->pi_lock); ... spin_unlock(&hb->lock); In !RT this works, but on RT the migrate_enable() function which is called from spin_unlock() sees atomic context due to the held pi_lock and just decrements the migrate_disable_atomic counter of the task. Now the next call to migrate_disable() sees the counter being negative and issues a warning. That check should be in migrate_enable() already. Fix this by dropping pi_lock before unlocking hb->lock and reaquire pi_lock after that again. This is safe as the loop code reevaluates head again under the pi_lock. Reported-by: Yong Zhang <yong.zhang@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 139/258 [ Author: Grygorii Strashko Email: Grygorii.Strashko@linaro.org Subject: pid.h: include atomic.h Date: Tue, 21 Jul 2015 19:43:56 +0300 This patch fixes build error: CC kernel/pid_namespace.o In file included from kernel/pid_namespace.c:11:0: include/linux/pid.h: In function 'get_pid': include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration] atomic_inc(&pid->count); ^ which happens when CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_SPINLOCK=n CONFIG_DEBUG_MUTEXES=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PID_NS=y Vanilla gets this via spinlock.h. Signed-off-by: Grygorii Strashko <Grygorii.Strashko@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 140/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm: include definition for cpumask_t Date: Thu, 22 Dec 2016 17:28:33 +0100 This definition gets pulled in by other files. With the (later) split of RCU and spinlock.h it won't compile anymore. The split is done in ("rbtree: don't include the rcu header"). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 141/258 [ Author: "Wolfgang M. Reimer" Email: linuxball@gmail.com Subject: locking: locktorture: Do NOT include rwlock.h directly Date: Tue, 21 Jul 2015 16:20:07 +0200 Including rwlock.h directly will cause kernel builds to fail if CONFIG_PREEMPT_RT_FULL is defined. The correct header file (rwlock_rt.h OR rwlock.h) will be included by spinlock.h which is included by locktorture.c anyway. Cc: stable-rt@vger.kernel.org Signed-off-by: Wolfgang M. Reimer <linuxball@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 142/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Add rtmutex_lock_killable() Date: Thu, 9 Jun 2011 11:43:52 +0200 Add "killable" type to rtmutex. We need this since rtmutex are used as "normal" mutexes which do use this type. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 143/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Make lock_killable work Date: Sat, 1 Apr 2017 12:50:59 +0200 Locking an rt mutex killable does not work because signal handling is restricted to TASK_INTERRUPTIBLE. Use signal_pending_state() unconditionaly. Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 144/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: spinlock: Split the lock types header Date: Wed, 29 Jun 2011 19:34:01 +0200 Split raw_spinlock into its own file and the remaining spinlock_t into its own non-RT header. The non-RT header will be replaced later by sleeping spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 145/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Avoid include hell Date: Wed, 29 Jun 2011 20:06:39 +0200 Include only the required raw types. This avoids pulling in the complete spinlock header which in turn requires rtmutex.h at some point. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 146/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rbtree: don't include the rcu header Date: Wed, 22 Aug 2018 09:24:48 -0400 The RCU header pulls in spinlock.h and fails due not yet defined types: |In file included from include/linux/spinlock.h:275:0, | from include/linux/rcupdate.h:38, | from include/linux/rbtree.h:34, | from include/linux/rtmutex.h:17, | from include/linux/spinlock_types.h:18, | from kernel/bounds.c:13: |include/linux/rwlock_rt.h:16:38: error: unknown type name ‘rwlock_t’ | extern void __lockfunc rt_write_lock(rwlock_t *rwlock); | ^ This patch moves the required RCU function from the rcupdate.h header file into a new header file which can be included by both users. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 147/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: Provide rt_mutex_slowlock_locked() Date: Thu, 12 Oct 2017 16:14:22 +0200 This is the inner-part of rt_mutex_slowlock(), required for rwsem-rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 148/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: export lockdep-less version of rt_mutex's lock, trylock and unlock Date: Thu, 12 Oct 2017 16:36:39 +0200 Required for lock implementation ontop of rtmutex. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 149/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: add sleeping lock implementation Date: Thu, 12 Oct 2017 17:11:19 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 150/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: add mutex implementation based on rtmutex Date: Thu, 12 Oct 2017 17:17:03 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 151/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: add rwsem implementation based on rtmutex Date: Thu, 12 Oct 2017 17:28:34 +0200 The RT specific R/W semaphore implementation restricts the number of readers to one because a writer cannot block on multiple readers and inherit its priority or budget. The single reader restricting is painful in various ways: - Performance bottleneck for multi-threaded applications in the page fault path (mmap sem) - Progress blocker for drivers which are carefully crafted to avoid the potential reader/writer deadlock in mainline. The analysis of the writer code pathes shows, that properly written RT tasks should not take them. Syscalls like mmap(), file access which take mmap sem write locked have unbound latencies which are completely unrelated to mmap sem. Other R/W sem users like graphics drivers are not suitable for RT tasks either. So there is little risk to hurt RT tasks when the RT rwsem implementation is changed in the following way: - Allow concurrent readers - Make writers block until the last reader left the critical section. This blocking is not subject to priority/budget inheritance. - Readers blocked on a writer inherit their priority/budget in the normal way. There is a drawback with this scheme. R/W semaphores become writer unfair though the applications which have triggered writer starvation (mostly on mmap_sem) in the past are not really the typical workloads running on a RT system. So while it's unlikely to hit writer starvation, it's possible. If there are unexpected workloads on RT systems triggering it, we need to rethink the approach. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 152/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: add rwlock implementation based on rtmutex Date: Thu, 12 Oct 2017 17:18:06 +0200 The implementation is bias-based, similar to the rwsem implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 153/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rtmutex: wire up RT's locking Date: Thu, 12 Oct 2017 17:31:14 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 154/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rtmutex: add ww_mutex addon for mutex-rt Date: Thu, 12 Oct 2017 17:34:38 +0200 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 155/258 [ Author: Mikulas Patocka Email: mpatocka@redhat.com Subject: locking/rt-mutex: fix deadlock in device mapper / block-IO Date: Mon, 13 Nov 2017 12:56:53 -0500 When some block device driver creates a bio and submits it to another block device driver, the bio is added to current->bio_list (in order to avoid unbounded recursion). However, this queuing of bios can cause deadlocks, in order to avoid them, device mapper registers a function flush_current_bio_list. This function is called when device mapper driver blocks. It redirects bios queued on current->bio_list to helper workqueues, so that these bios can proceed even if the driver is blocked. The problem with CONFIG_PREEMPT_RT_FULL is that when the device mapper driver blocks, it won't call flush_current_bio_list (because tsk_is_pi_blocked returns true in sched_submit_work), so deadlocks in block device stack can happen. Note that we can't call blk_schedule_flush_plug if tsk_is_pi_blocked returns true - that would cause BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in task_blocks_on_rt_mutex when flush_current_bio_list attempts to take a spinlock. So the proper fix is to call blk_schedule_flush_plug in rt_mutex_fastlock, when fast acquire failed and when the task is about to block. CC: stable-rt@vger.kernel.org [bigeasy: The deadlock is not device-mapper specific, it can also occur in plain EXT4] Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 156/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking/rtmutex: re-init the wait_lock in rt_mutex_init_proxy_locked() Date: Thu, 16 Nov 2017 16:48:48 +0100 We could provide a key-class for the lockdep (and fixup all callers) or move the init to all callers (like it was) in order to avoid lockdep seeing a double-lock of the wait_lock. Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 157/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: ptrace: fix ptrace vs tasklist_lock race Date: Thu, 29 Aug 2013 18:21:04 +0200 As explained by Alexander Fyodorov <halcy@yandex.ru>: |read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, |and it can remove __TASK_TRACED from task->state (by moving it to |task->saved_state). If parent does wait() on child followed by a sys_ptrace |call, the following race can happen: | |- child sets __TASK_TRACED in ptrace_stop() |- parent does wait() which eventually calls wait_task_stopped() and returns | child's pid |- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves | __TASK_TRACED flag to saved_state |- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 158/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: rtmutex: annotate sleeping lock context Date: Thu, 21 Sep 2017 14:25:13 +0200 The RCU code complains on schedule() within a rcu_readlock() section. The valid scenario on -RT is if a sleeping is held. In order to suppress the warning the mirgrate_disable counter was used to identify the invocation of schedule() due to lock contention. Grygorii Strashko report that during CPU hotplug we might see the warning via rt_spin_lock() -> migrate_disable() -> pin_current_cpu() -> __read_rt_lock() because the counter is not yet set. It is also possible to trigger the warning from cpu_chill() (seen on a kblockd_mod_delayed_work_on() caller). To address this RCU warning I annotate the sleeping lock context. The counter is incremented before migrate_disable() so the warning Grygorii should not trigger anymore. Additionally I use that counter in cpu_chill() to avoid the RCU warning from there. Reported-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 159/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: sched/migrate_disable: fallback to preempt_disable() instead barrier() Date: Thu, 5 Jul 2018 14:44:51 +0200 On SMP + !RT migrate_disable() is still around. It is not part of spin_lock() anymore so it has almost no users. However the futex code has a workaround for the !in_atomic() part of migrate disable which fails because the matching migrade_disable() is no longer part of spin_lock(). On !SMP + !RT migrate_disable() is reduced to barrier(). This is not optimal because we few spots where a "preempt_disable()" statement was replaced with "migrate_disable()". We also used the migration_disable counter to figure out if a sleeping lock is acquired so RCU does not complain about schedule() during rcu_read_lock() while a sleeping lock is held. This changed, we no longer use it, we have now a sleeping_lock counter for the RCU purpose. This means we can now: - for SMP + RT_BASE full migration program, nothing changes here - for !SMP + RT_BASE the migration counting is no longer required. It used to ensure that the task is not migrated to another CPU and that this CPU remains online. !SMP ensures that already. Move it to CONFIG_SCHED_DEBUG so the counting is done for debugging purpose only. - for all other cases including !RT fallback to preempt_disable(). The only remaining users of migrate_disable() are those which were converted from preempt_disable() and the futex workaround which is already in the preempt_disable() section due to the spin_lock that is held. Cc: stable-rt@vger.kernel.org Reported-by: joe.korty@concurrent-rt.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 160/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: locking: don't check for __LINUX_SPINLOCK_TYPES_H on -RT archs Date: Fri, 4 Aug 2017 17:40:42 +0200 Upstream uses arch_spinlock_t within spinlock_t and requests that spinlock_types.h header file is included first. On -RT we have the rt_mutex with its raw_lock wait_lock which needs architectures' spinlock_types.h header file for its definition. However we need rt_mutex first because it is used to build the spinlock_t so that check does not work for us. Therefore I am dropping that check. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 161/258 [ Author: Peter Zijlstra Email: a.p.zijlstra@chello.nl Subject: rcu: Frob softirq test Date: Sat, 13 Aug 2011 00:23:17 +0200 With RT_FULL we get the below wreckage: [ 126.060484] ======================================================= [ 126.060486] [ INFO: possible circular locking dependency detected ] [ 126.060489] 3.0.1-rt10+ #30 [ 126.060490] ------------------------------------------------------- [ 126.060492] irq/24-eth0/1235 is trying to acquire lock: [ 126.060495] (&(lock)->wait_lock#2){+.+...}, at: [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060503] [ 126.060504] but task is already holding lock: [ 126.060506] (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429 [ 126.060511] [ 126.060511] which lock already depends on the new lock. [ 126.060513] [ 126.060514] [ 126.060514] the existing dependency chain (in reverse order) is: [ 126.060516] [ 126.060516] -> #1 (&p->pi_lock){-...-.}: [ 126.060519] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060524] [<ffffffff8150291e>] _raw_spin_lock_irqsave+0x4b/0x85 [ 126.060527] [<ffffffff810b5aa4>] task_blocks_on_rt_mutex+0x36/0x20f [ 126.060531] [<ffffffff815019bb>] rt_mutex_slowlock+0xd1/0x15a [ 126.060534] [<ffffffff81501ae3>] rt_mutex_lock+0x2d/0x2f [ 126.060537] [<ffffffff810d9020>] rcu_boost+0xad/0xde [ 126.060541] [<ffffffff810d90ce>] rcu_boost_kthread+0x7d/0x9b [ 126.060544] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060547] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060551] [ 126.060552] -> #0 (&(lock)->wait_lock#2){+.+...}: [ 126.060555] [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816 [ 126.060558] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060561] [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73 [ 126.060564] [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060566] [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29 [ 126.060569] [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4 [ 126.060573] [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89 [ 126.060576] [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5 [ 126.060580] [<ffffffff8107511c>] try_to_wake_up+0x175/0x429 [ 126.060583] [<ffffffff81075425>] wake_up_process+0x15/0x17 [ 126.060585] [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26 [ 126.060590] [<ffffffff81081df9>] irq_exit+0x49/0x55 [ 126.060593] [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98 [ 126.060597] [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20 [ 126.060600] [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44 [ 126.060603] [<ffffffff810d582c>] irq_thread+0xde/0x1af [ 126.060606] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060608] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060611] [ 126.060612] other info that might help us debug this: [ 126.060614] [ 126.060615] Possible unsafe locking scenario: [ 126.060616] [ 126.060617] CPU0 CPU1 [ 126.060619] ---- ---- [ 126.060620] lock(&p->pi_lock); [ 126.060623] lock(&(lock)->wait_lock); [ 126.060625] lock(&p->pi_lock); [ 126.060627] lock(&(lock)->wait_lock); [ 126.060629] [ 126.060629] *** DEADLOCK *** [ 126.060630] [ 126.060632] 1 lock held by irq/24-eth0/1235: [ 126.060633] #0: (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429 [ 126.060638] [ 126.060638] stack backtrace: [ 126.060641] Pid: 1235, comm: irq/24-eth0 Not tainted 3.0.1-rt10+ #30 [ 126.060643] Call Trace: [ 126.060644] <IRQ> [<ffffffff810acbde>] print_circular_bug+0x289/0x29a [ 126.060651] [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816 [ 126.060655] [<ffffffff810ab3aa>] ? trace_hardirqs_off_caller+0x1f/0x99 [ 126.060658] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060661] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060664] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060668] [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73 [ 126.060671] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060674] [<ffffffff810d9655>] ? rcu_report_qs_rsp+0x87/0x8c [ 126.060677] [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060680] [<ffffffff810d9ea3>] ? rcu_read_unlock_special+0x9b/0x1c4 [ 126.060683] [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29 [ 126.060687] [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4 [ 126.060690] [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89 [ 126.060693] [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5 [ 126.060696] [<ffffffff810683da>] ? select_task_rq_rt+0x27/0xd5 [ 126.060701] [<ffffffff810a852a>] ? clockevents_program_event+0x8e/0x90 [ 126.060704] [<ffffffff8107511c>] try_to_wake_up+0x175/0x429 [ 126.060708] [<ffffffff810a95dc>] ? tick_program_event+0x1f/0x21 [ 126.060711] [<ffffffff81075425>] wake_up_process+0x15/0x17 [ 126.060715] [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26 [ 126.060718] [<ffffffff81081df9>] irq_exit+0x49/0x55 [ 126.060721] [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98 [ 126.060724] [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20 [ 126.060726] <EOI> [<ffffffff81072855>] ? migrate_disable+0x75/0x12d [ 126.060733] [<ffffffff81080a61>] ? local_bh_disable+0xe/0x1f [ 126.060736] [<ffffffff81080a70>] ? local_bh_disable+0x1d/0x1f [ 126.060739] [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44 [ 126.060742] [<ffffffff81502ac0>] ? _raw_spin_unlock_irq+0x3b/0x59 [ 126.060745] [<ffffffff810d582c>] irq_thread+0xde/0x1af [ 126.060748] [<ffffffff810d5937>] ? irq_thread_fn+0x3a/0x3a [ 126.060751] [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1 [ 126.060754] [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1 [ 126.060757] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060761] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060764] [<ffffffff81069ed7>] ? finish_task_switch+0x87/0x10a [ 126.060768] [<ffffffff81502ec4>] ? retint_restore_args+0xe/0xe [ 126.060771] [<ffffffff8109a6c7>] ? __init_kthread_worker+0x8c/0x8c [ 126.060774] [<ffffffff81509b10>] ? gs_change+0xb/0xb Because irq_exit() does: void irq_exit(void) { account_system_vtime(current); trace_hardirq_exit(); sub_preempt_count(IRQ_EXIT_OFFSET); if (!in_interrupt() && local_softirq_pending()) invoke_softirq(); ... } Which triggers a wakeup, which uses RCU, now if the interrupted task has t->rcu_read_unlock_special set, the rcu usage from the wakeup will end up in rcu_read_unlock_special(). rcu_read_unlock_special() will test for in_irq(), which will fail as we just decremented preempt_count with IRQ_EXIT_OFFSET, and in_sering_softirq(), which for PREEMPT_RT_FULL reads: int in_serving_softirq(void) { int res; preempt_disable(); res = __get_cpu_var(local_softirq_runner) == current; preempt_enable(); return res; } Which will thus also fail, resulting in the above wreckage. The 'somewhat' ugly solution is to open-code the preempt_count() test in rcu_read_unlock_special(). Also, we're not at all sure how ->rcu_read_unlock_special gets set here... so this is very likely a bandaid and more thought is required. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ] 162/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rcu: Merge RCU-bh into RCU-preempt Date: Wed, 5 Oct 2011 11:59:38 -0700 The Linux kernel has long RCU-bh read-side critical sections that intolerably increase scheduling latency under mainline's RCU-bh rules, which include RCU-bh read-side critical sections being non-preemptible. This patch therefore arranges for RCU-bh to be implemented in terms of RCU-preempt for CONFIG_PREEMPT_RT_FULL=y. This has the downside of defeating the purpose of RCU-bh, namely, handling the case where the system is subjected to a network-based denial-of-service attack that keeps at least one CPU doing full-time softirq processing. This issue will be fixed by a later commit. The current commit will need some work to make it appropriate for mainline use, for example, it needs to be extended to cover Tiny RCU. [ paulmck: Added a useful changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20111005185938.GA20403@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 163/258 [ Author: "Paul E. McKenney" Email: paulmck@linux.vnet.ibm.com Subject: rcu: Make ksoftirqd do RCU quiescent states Date: Wed, 5 Oct 2011 11:45:18 -0700 Implementing RCU-bh in terms of RCU-preempt makes the system vulnerable to network-based denial-of-service attacks. This patch therefore makes __do_softirq() invoke rcu_bh_qs(), but only when __do_softirq() is running in ksoftirqd context. A wrapper layer in interposed so that other calls to __do_softirq() avoid invoking rcu_bh_qs(). The underlying function __do_softirq_common() does the actual work. The reason that rcu_bh_qs() is bad in these non-ksoftirqd contexts is that there might be a local_bh_enable() inside an RCU-preempt read-side critical section. This local_bh_enable() can invoke __do_softirq() directly, so if __do_softirq() were to invoke rcu_bh_qs() (which just calls rcu_preempt_qs() in the PREEMPT_RT_FULL case), there would be an illegal RCU-preempt quiescent state in the middle of an RCU-preempt read-side critical section. Therefore, quiescent states can only happen in cases where __do_softirq() is invoked directly from ksoftirqd. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20111005184518.GA21601@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 164/258 [ Author: "Paul E. McKenney" Email: paulmck@linux.vnet.ibm.com Subject: rcu: Eliminate softirq processing from rcutree Date: Mon, 4 Nov 2013 13:21:10 -0800 Running RCU out of softirq is a problem for some workloads that would like to manage RCU core processing independently of other softirq work, for example, setting kthread priority. This commit therefore moves the RCU core work from softirq to a per-CPU/per-flavor SCHED_OTHER kthread named rcuc. The SCHED_OTHER approach avoids the scalability problems that appeared with the earlier attempt to move RCU core processing to from softirq to kthreads. That said, kernels built with RCU_BOOST=y will run the rcuc kthreads at the RCU-boosting priority. Reported-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 165/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: srcu: use cpu_online() instead custom check Date: Wed, 13 Sep 2017 14:43:41 +0200 The current check via srcu_online is slightly racy because after looking at srcu_online there could be an interrupt that interrupted us long enough until the CPU we checked against went offline. An alternative would be to hold the hotplug rwsem (so the CPUs don't change their state) and then check based on cpu_online() if we queue it on a specific CPU or not. queue_work_on() itself can handle if something is enqueued on an offline CPU but a timer which is enqueued on an offline CPU won't fire until the CPU is back online. I am not sure if the removal in rcu_init() is okay or not. I assume that SRCU won't enqueue a work item before SRCU is up and ready. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 166/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: srcu: replace local_irqsave() with a locallock Date: Thu, 12 Oct 2017 18:37:12 +0200 There are two instances which disable interrupts in order to become a stable this_cpu_ptr() pointer. The restore part is coupled with spin_unlock_irqrestore() which does not work on RT. Replace the local_irq_save() call with the appropriate local_lock() version of it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 167/258 [ Author: Julia Cartwright Email: julia@ni.com Subject: rcu: enable rcu_normal_after_boot by default for RT Date: Wed, 12 Oct 2016 11:21:14 -0500 The forcing of an expedited grace period is an expensive and very RT-application unfriendly operation, as it forcibly preempts all running tasks on CPUs which are preventing the gp from expiring. By default, as a policy decision, disable the expediting of grace periods (after boot) on configurations which enable PREEMPT_RT_FULL. Suggested-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 168/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tty/serial/omap: Make the locking RT aware Date: Thu, 28 Jul 2011 13:32:57 +0200 The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 169/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: tty/serial/pl011: Make the locking work on RT Date: Tue, 8 Jan 2013 21:36:51 +0100 The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 170/258 [ Author: Ingo Molnar Email: mingo@elte.hu Subject: rt: Improve the serial console PASS_LIMIT Date: Wed, 14 Dec 2011 13:05:54 +0100 Beyond the warning: drivers/tty/serial/8250/8250.c:1613:6: warning: unused variable ‘pass_counter’ [-Wunused-variable] the solution of just looping infinitely was ugly - up it to 1 million to give it a chance to continue in some really ugly situation. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 171/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: tty: serial: 8250: don't take the trylock during oops Date: Mon, 11 Apr 2016 16:55:02 +0200 An oops with irqs off (panic() from irqsafe hrtimer like the watchdog timer) will lead to a lockdep warning on each invocation and as such never completes. Therefore we skip the trylock in the oops case. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 172/258 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: locking/percpu-rwsem: Remove preempt_disable variants Date: Wed, 23 Nov 2016 16:29:32 +0100 Effective revert commit: 87709e28dc7c ("fs/locks: Use percpu_down_read_preempt_disable()") This is causing major pain for PREEMPT_RT and is only a very small performance issue for PREEMPT=y. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> ] 173/258 [ Author: Yong Zhang Email: yong.zhang0@gmail.com Subject: mm: Protect activate_mm() by preempt_[disable&enable]_rt() Date: Tue, 15 May 2012 13:53:56 +0800 User preempt_*_rt instead of local_irq_*_rt or otherwise there will be warning on ARM like below: WARNING: at build/linux/kernel/smp.c:459 smp_call_function_many+0x98/0x264() Modules linked in: [<c0013bb4>] (unwind_backtrace+0x0/0xe4) from [<c001be94>] (warn_slowpath_common+0x4c/0x64) [<c001be94>] (warn_slowpath_common+0x4c/0x64) from [<c001bec4>] (warn_slowpath_null+0x18/0x1c) [<c001bec4>] (warn_slowpath_null+0x18/0x1c) from [<c0053ff8>](smp_call_function_many+0x98/0x264) [<c0053ff8>] (smp_call_function_many+0x98/0x264) from [<c0054364>] (smp_call_function+0x44/0x6c) [<c0054364>] (smp_call_function+0x44/0x6c) from [<c0017d50>] (__new_context+0xbc/0x124) [<c0017d50>] (__new_context+0xbc/0x124) from [<c009e49c>] (flush_old_exec+0x460/0x5e4) [<c009e49c>] (flush_old_exec+0x460/0x5e4) from [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) from [<c009d060>] (search_binary_handler+0x94/0x2a4) [<c009d060>] (search_binary_handler+0x94/0x2a4) from [<c009e8fc>] (do_execve+0x254/0x364) [<c009e8fc>] (do_execve+0x254/0x364) from [<c0010e84>] (sys_execve+0x34/0x54) [<c0010e84>] (sys_execve+0x34/0x54) from [<c000da00>] (ret_fast_syscall+0x0/0x30) ---[ end trace 0000000000000002 ]--- The reason is that ARM need irq enabled when doing activate_mm(). According to mm-protect-activate-switch-mm.patch, actually preempt_[disable|enable]_rt() is sufficient. Inspired-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1337061236-1766-1-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 174/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: bring back explicit INIT_HLIST_BL_HEAD init Date: Wed, 13 Sep 2017 12:32:34 +0200 Commit 3d375d78593c ("mm: update callers to use HASH_ZERO flag") removed INIT_HLIST_BL_HEAD and uses the ZERO flag instead for the init. However on RT we have also a spinlock which needs an init call so we can't use that. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 175/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: disable preemption on i_dir_seq's write side Date: Fri, 20 Oct 2017 11:29:53 +0200 i_dir_seq is an opencoded seqcounter. Based on the code it looks like we could have two writers in parallel despite the fact that the d_lock is held. The problem is that during the write process on RT the preemption is still enabled and if this process is interrupted by a reader with RT priority then we lock up. To avoid that lock up I am disabling the preemption during the update. The rename of i_dir_seq is here to ensure to catch new write sides in future. Cc: stable-rt@vger.kernel.org Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 176/258 [ Author: Julia Cartwright Email: julia@ni.com Subject: squashfs: make use of local lock in multi_cpu decompressor Date: Mon, 7 May 2018 08:58:57 -0500 Currently, the squashfs multi_cpu decompressor makes use of get_cpu_ptr()/put_cpu_ptr(), which unconditionally disable preemption during decompression. Because the workload is distributed across CPUs, all CPUs can observe a very high wakeup latency, which has been seen to be as much as 8000us. Convert this decompressor to make use of a local lock, which will allow execution of the decompressor with preemption-enabled, but also ensure concurrent accesses to the percpu compressor data on the local CPU will be serialized. Cc: stable-rt@vger.kernel.org Reported-by: Alexander Stein <alexander.stein@systec-electronic.com> Tested-by: Alexander Stein <alexander.stein@systec-electronic.com> Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 177/258 [ Author: Daniel Wagner Email: wagi@monom.org Subject: thermal: Defer thermal wakups to threads Date: Tue, 17 Feb 2015 09:37:44 +0100 On RT the spin lock in pkg_temp_thermal_platfrom_thermal_notify will call schedule while we run in irq context. [<ffffffff816850ac>] dump_stack+0x4e/0x8f [<ffffffff81680f7d>] __schedule_bug+0xa6/0xb4 [<ffffffff816896b4>] __schedule+0x5b4/0x700 [<ffffffff8168982a>] schedule+0x2a/0x90 [<ffffffff8168a8b5>] rt_spin_lock_slowlock+0xe5/0x2d0 [<ffffffff8168afd5>] rt_spin_lock+0x25/0x30 [<ffffffffa03a7b75>] pkg_temp_thermal_platform_thermal_notify+0x45/0x134 [x86_pkg_temp_thermal] [<ffffffff8103d4db>] ? therm_throt_process+0x1b/0x160 [<ffffffff8103d831>] intel_thermal_interrupt+0x211/0x250 [<ffffffff8103d8c1>] smp_thermal_interrupt+0x21/0x40 [<ffffffff8169415d>] thermal_interrupt+0x6d/0x80 Let's defer the work to a kthread. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> [bigeasy: reoder init/denit position. TODO: flush swork on exit] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 178/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: fs/epoll: Do not disable preemption on RT Date: Fri, 8 Jul 2011 16:35:35 +0200 ep_call_nested() takes a sleeping lock so we can't disable preemption. The light version is enough since ep_call_nested() doesn't mind beeing invoked twice on the same CPU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 179/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: mm/vmalloc: Another preempt disable region which sucks Date: Tue, 12 Jul 2011 11:39:36 +0200 Avoid the preempt disable version of get_cpu_var(). The inner-lock should provide enough serialisation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 180/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block: mq: use cpu_light() Date: Wed, 9 Apr 2014 10:37:23 +0200 there is a might sleep splat because get_cpu() disables preemption and later we grab a lock. As a workaround for this we use get_cpu_light(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 181/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block/mq: do not invoke preempt_disable() Date: Tue, 14 Jul 2015 14:26:34 +0200 preempt_disable() and get_cpu() don't play well together with the sleeping locks it tries to allocate later. It seems to be enough to replace it with get_cpu_light() and migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 182/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block/mq: don't complete requests via IPI Date: Thu, 29 Jan 2015 15:10:08 +0100 The IPI runs in hardirq context and there are sleeping locks. This patch moves the completion into a workqueue. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 183/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: md: raid5: Make raid5_percpu handling RT aware Date: Tue, 6 Apr 2010 16:51:31 +0200 __raid_run_ops() disables preemption with get_cpu() around the access to the raid5_percpu variables. That causes scheduling while atomic spews on RT. Serialize the access to the percpu data with a lock and keep the code preemptible. Reported-by: Udo van den Heuvel <udovdh@xs4all.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Udo van den Heuvel <udovdh@xs4all.nl> ] 184/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: rt: Introduce cpu_chill() Date: Wed, 7 Mar 2012 20:51:03 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): | |Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken |up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is |called from softirq context, it may block the ksoftirqd() from running, in |which case, it may never wake up the msleep() causing the deadlock. + bigeasy later changed to schedule_hrtimeout() |If a task calls cpu_chill() and gets woken up by a regular or spurious |wakeup and has a signal pending, then it exits the sleep loop in |do_nanosleep() and sets up the restart block. If restart->nanosleep.type is |not TI_NONE then this results in accessing a stale user pointer from a |previously interrupted syscall and a copy to user based on the stale |pointer or a BUG() when 'type' is not supported in nanosleep_copyout(). + bigeasy: add PF_NOFREEZE: | [....] Waiting for /dev to be fully populated... | ===================================== | [ BUG: udevd/229 still has locks held! ] | 3.12.11-rt17 #23 Not tainted | ------------------------------------- | 1 lock held by udevd/229: | #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 | | stack backtrace: | CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 | (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) | (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) | (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) | (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) | (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) | (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) | (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) | (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) | (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) | (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) | (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) | (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) | (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) | (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) | (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) | (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) | (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 185/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: block: blk-mq: move blk_queue_usage_counter_release() into process context Date: Tue, 13 Mar 2018 13:49:16 +0100 | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:914 | in_atomic(): 1, irqs_disabled(): 0, pid: 255, name: kworker/u257:6 | 5 locks held by kworker/u257:6/255: | #0: ("events_unbound"){.+.+.+}, at: [<ffffffff8108edf1>] process_one_work+0x171/0x5e0 | #1: ((&entry->work)){+.+.+.}, at: [<ffffffff8108edf1>] process_one_work+0x171/0x5e0 | #2: (&shost->scan_mutex){+.+.+.}, at: [<ffffffffa000faa3>] __scsi_add_device+0xa3/0x130 [scsi_mod] | #3: (&set->tag_list_lock){+.+...}, at: [<ffffffff812f09fa>] blk_mq_init_queue+0x96a/0xa50 | #4: (rcu_read_lock_sched){......}, at: [<ffffffff8132887d>] percpu_ref_kill_and_confirm+0x1d/0x120 | Preemption disabled at:[<ffffffff812eff76>] blk_mq_freeze_queue_start+0x56/0x70 | | CPU: 2 PID: 255 Comm: kworker/u257:6 Not tainted 3.18.7-rt0+ #1 | Workqueue: events_unbound async_run_entry_fn | 0000000000000003 ffff8800bc29f998 ffffffff815b3a12 0000000000000000 | 0000000000000000 ffff8800bc29f9b8 ffffffff8109aa16 ffff8800bc29fa28 | ffff8800bc5d1bc8 ffff8800bc29f9e8 ffffffff815b8dd4 ffff880000000000 | Call Trace: | [<ffffffff815b3a12>] dump_stack+0x4f/0x7c | [<ffffffff8109aa16>] __might_sleep+0x116/0x190 | [<ffffffff815b8dd4>] rt_spin_lock+0x24/0x60 | [<ffffffff810b6089>] __wake_up+0x29/0x60 | [<ffffffff812ee06e>] blk_mq_usage_counter_release+0x1e/0x20 | [<ffffffff81328966>] percpu_ref_kill_and_confirm+0x106/0x120 | [<ffffffff812eff76>] blk_mq_freeze_queue_start+0x56/0x70 | [<ffffffff812f0000>] blk_mq_update_tag_set_depth+0x40/0xd0 | [<ffffffff812f0a1c>] blk_mq_init_queue+0x98c/0xa50 | [<ffffffffa000dcf0>] scsi_mq_alloc_queue+0x20/0x60 [scsi_mod] | [<ffffffffa000ea35>] scsi_alloc_sdev+0x2f5/0x370 [scsi_mod] | [<ffffffffa000f494>] scsi_probe_and_add_lun+0x9e4/0xdd0 [scsi_mod] | [<ffffffffa000fb26>] __scsi_add_device+0x126/0x130 [scsi_mod] | [<ffffffffa013033f>] ata_scsi_scan_host+0xaf/0x200 [libata] | [<ffffffffa012b5b6>] async_port_probe+0x46/0x60 [libata] | [<ffffffff810978fb>] async_run_entry_fn+0x3b/0xf0 | [<ffffffff8108ee81>] process_one_work+0x201/0x5e0 percpu_ref_kill_and_confirm() invokes blk_mq_usage_counter_release() in a rcu-sched region. swait based wake queue can't be used due to wake_up_all() usage and disabled interrupts in !RT configs (as reported by Corey Minyard). The wq_has_sleeper() check has been suggested by Peter Zijlstra. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 186/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: block: Use cpu_chill() for retry loops Date: Thu, 20 Dec 2012 18:28:26 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Steven also observed a live lock when there was a concurrent priority boosting going on. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 187/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: fs: dcache: Use cpu_chill() in trylock loops Date: Wed, 7 Mar 2012 21:00:34 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 188/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Use cpu_chill() instead of cpu_relax() Date: Wed, 7 Mar 2012 21:10:04 +0100 Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 189/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: fs/dcache: use swait_queue instead of waitqueue Date: Wed, 14 Sep 2016 14:35:49 +0200 __d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 190/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: workqueue: Use normal rcu Date: Wed, 24 Jul 2013 15:26:54 +0200 There is no need for sched_rcu. The undocumented reason why sched_rcu is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by abusing the fact that sched_rcu reader side critical sections are also protected by preempt or irq disabled regions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 191/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: workqueue: Use local irq lock instead of irq disable regions Date: Sun, 17 Jul 2011 21:42:26 +0200 Use a local_irq_lock as a replacement for irq off regions. We keep the semantic of irq-off in regard to the pool->lock and remain preemptible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 192/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: workqueue: Prevent workqueue versus ata-piix livelock Date: Mon, 1 Jul 2013 11:02:42 +0200 An Intel i7 system regularly detected rcu_preempt stalls after the kernel was upgraded from 3.6-rt to 3.8-rt. When the stall happened, disk I/O was no longer possible, unless the system was restarted. The kernel message was: INFO: rcu_preempt self-detected stall on CPU { 6} [..] NMI backtrace for cpu 6 CPU 6 Pid: 119, comm: irq/19-ata_piix Not tainted 3.8.13-rt13 #11 Shuttle Inc. SX58/SX58 RIP: 0010:[<ffffffff8124ca60>] [<ffffffff8124ca60>] ip_compute_csum+0x30/0x30 RSP: 0018:ffff880333303cb0 EFLAGS: 00000002 RAX: 0000000000000006 RBX: 00000000000003e9 RCX: 0000000000000034 RDX: 0000000000000000 RSI: ffffffff81aa16d0 RDI: 0000000000000001 RBP: ffff880333303ce8 R08: ffffffff81aa16d0 R09: ffffffff81c1b8cc R10: 0000000000000000 R11: 0000000000000000 R12: 000000000005161f R13: 0000000000000006 R14: ffffffff81aa16d0 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff880333300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003c1b2bb420 CR3: 0000000001a0f000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process irq/19-ata_piix (pid: 119, threadinfo ffff88032d88a000, task ffff88032df80000) Stack: ffffffff8124cb32 000000000005161e 00000000000003e9 0000000000001000 0000000000009022 ffffffff81aa16d0 0000000000000002 ffff880333303cf8 ffffffff8124caa9 ffff880333303d08 ffffffff8124cad2 ffff880333303d28 Call Trace: <IRQ> [<ffffffff8124cb32>] ? delay_tsc+0x33/0xe3 [<ffffffff8124caa9>] __delay+0xf/0x11 [<ffffffff8124cad2>] __const_udelay+0x27/0x29 [<ffffffff8102d1fa>] native_safe_apic_wait_icr_idle+0x39/0x45 [<ffffffff8102dc9b>] __default_send_IPI_dest_field.constprop.0+0x1e/0x58 [<ffffffff8102dd1e>] default_send_IPI_mask_sequence_phys+0x49/0x7d [<ffffffff81030326>] physflat_send_IPI_all+0x17/0x19 [<ffffffff8102de53>] arch_trigger_all_cpu_backtrace+0x50/0x79 [<ffffffff810b21d0>] rcu_check_callbacks+0x1cb/0x568 [<ffffffff81048c9c>] ? raise_softirq+0x2e/0x35 [<ffffffff81086be0>] ? tick_sched_do_timer+0x38/0x38 [<ffffffff8104f653>] update_process_times+0x44/0x55 [<ffffffff81086866>] tick_sched_handle+0x4a/0x59 [<ffffffff81086c1c>] tick_sched_timer+0x3c/0x5b [<ffffffff81062845>] __run_hrtimer+0x9b/0x158 [<ffffffff810631d8>] hrtimer_interrupt+0x172/0x2aa [<ffffffff8102d498>] smp_apic_timer_interrupt+0x76/0x89 [<ffffffff814d881d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff81057cd2>] ? __local_lock_irqsave+0x17/0x4a [<ffffffff81059336>] try_to_grab_pending+0x42/0x17e [<ffffffff8105a699>] mod_delayed_work_on+0x32/0x88 [<ffffffff8105a70b>] mod_delayed_work+0x1c/0x1e [<ffffffff8122ae84>] blk_run_queue_async+0x37/0x39 [<ffffffff81230985>] flush_end_io+0xf1/0x107 [<ffffffff8122e0da>] blk_finish_request+0x21e/0x264 [<ffffffff8122e162>] blk_end_bidi_request+0x42/0x60 [<ffffffff8122e1ba>] blk_end_request+0x10/0x12 [<ffffffff8132de46>] scsi_io_completion+0x1bf/0x492 [<ffffffff81335cec>] ? sd_done+0x298/0x2ef [<ffffffff81325a02>] scsi_finish_command+0xe9/0xf2 [<ffffffff8132dbcb>] scsi_softirq_done+0x106/0x10f [<ffffffff812333d3>] blk_done_softirq+0x77/0x87 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff810aa820>] ? irq_thread_fn+0x3a/0x3a [<ffffffff81048466>] local_bh_enable+0x43/0x72 [<ffffffff810aa866>] irq_forced_thread_fn+0x46/0x52 [<ffffffff810ab089>] irq_thread+0x8c/0x17c [<ffffffff810ab179>] ? irq_thread+0x17c/0x17c [<ffffffff810aaffd>] ? wake_threads_waitq+0x44/0x44 [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7b7c>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 The state of softirqd of this CPU at the time of the crash was: ksoftirqd/6 R running task 0 53 2 0x00000000 ffff88032fc39d18 0000000000000046 ffff88033330c4c0 ffff8803303f4710 ffff88032fc39fd8 ffff88032fc39fd8 0000000000000000 0000000000062500 ffff88032df88000 ffff8803303f4710 0000000000000000 ffff88032fc38000 Call Trace: [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff814d178c>] preempt_schedule+0x61/0x76 [<ffffffff8106cccf>] migrate_enable+0xe5/0x1df [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff8104ef52>] run_timer_softirq+0x161/0x1d6 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff8104840b>] run_ksoftirqd+0x2d/0x45 [<ffffffff8106658a>] smpboot_thread_fn+0x2ea/0x308 [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7afc>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 Apparently, the softirq demon and the ata_piix IRQ handler were waiting for each other to finish ending up in a livelock. After the below patch was applied, the system no longer crashes. Reported-by: Carsten Emde <C.Emde@osadl.org> Proposed-by: Thomas Gleixner <tglx@linutronix.de> Tested by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 193/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Distangle worker accounting from rqlock Date: Wed, 22 Jun 2011 19:47:03 +0200 The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 194/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: debugobjects: Make RT aware Date: Sun, 17 Jul 2011 21:41:35 +0200 Avoid filling the pool / allocating memory with irqs off(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 195/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: seqlock: Prevent rt starvation Date: Wed, 22 Feb 2012 12:03:30 +0100 If a low prio writer gets preempted while holding the seqlock write locked, a high prio reader spins forever on RT. To prevent this let the reader grab the spinlock, so it blocks and eventually boosts the writer. This way the writer can proceed and endless spinning is prevented. For seqcount writers we disable preemption over the update code path. Thanks to Al Viro for distangling some VFS code to make that possible. Nicholas Mc Guire: - spin_lock+unlock => spin_unlock_wait - __write_seqcount_begin => __raw_write_seqcount_begin Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 196/258 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light() Date: Wed, 18 Feb 2015 16:05:28 +0100 |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd |Preemption disabled at:[<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] |CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9 |Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 | ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002 | 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008 | ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000 |Call Trace: | [<ffffffff815bdeb5>] dump_stack+0x4f/0x9e | [<ffffffff81073c86>] __might_sleep+0xe6/0x150 | [<ffffffff815c3d84>] rt_spin_lock+0x24/0x50 | [<ffffffffa06beec0>] svc_xprt_do_enqueue+0x80/0x230 [sunrpc] | [<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] | [<ffffffffa06c03ed>] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc] | [<ffffffffa06b2693>] svc_addsock+0x143/0x200 [sunrpc] | [<ffffffffa072e69c>] write_ports+0x28c/0x340 [nfsd] | [<ffffffffa072d2ac>] nfsctl_transaction_write+0x4c/0x80 [nfsd] | [<ffffffff8117ee83>] vfs_write+0xb3/0x1d0 | [<ffffffff8117f889>] SyS_write+0x49/0xb0 | [<ffffffff815c4556>] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 197/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Use skbufhead with raw lock Date: Tue, 12 Jul 2011 15:38:34 +0200 Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 198/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: move xmit_recursion to per-task variable on -RT Date: Wed, 13 Jan 2016 15:55:02 +0100 A softirq on -RT can be preempted. That means one task is in __dev_queue_xmit(), gets preempted and another task may enter __dev_queue_xmit() aw well. netperf together with a bridge device will then trigger the `recursion alert` because each task increments the xmit_recursion variable which is per-CPU. A virtual device like br0 is required to trigger this warning. This patch moves the lock owner and counter to be per task instead per-CPU so it counts the recursion properly on -RT. The owner is also a task now and not a CPU number. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 199/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: provide a way to delegate processing a softirq to ksoftirqd Date: Wed, 20 Jan 2016 15:39:05 +0100 If the NET_RX uses up all of his budget it moves the following NAPI invocations into the `ksoftirqd`. On -RT it does not do so. Instead it rises the NET_RX softirq in its current context again. In order to get closer to mainline's behaviour this patch provides __raise_softirq_irqoff_ksoft() which raises the softirq in the ksoftird. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 200/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: dev: always take qdisc's busylock in __dev_xmit_skb() Date: Wed, 30 Mar 2016 13:36:29 +0200 The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If this task is now pushed away by a task with a higher priority then the task with the higher priority won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the task(s) with higher priority leave the CPU and the task with lower priority gets back and finishes the job. If we take always the busylock we ensure that the RT task can boost the low-prio task and submit the packet. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 201/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/Qdisc: use a seqlock instead seqcount Date: Wed, 14 Sep 2016 17:36:35 +0200 The seqcount disables preemption on -RT while it is held which can't remove. Also we don't want the reader to spin for ages if the writer is scheduled out. The seqlock on the other hand will serialize / sleep on the lock while writer is active. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 202/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: add back the missing serialization in ip_send_unicast_reply() Date: Wed, 31 Aug 2016 17:21:56 +0200 Some time ago Sami Pietikäinen reported a crash on -RT in ip_send_unicast_reply() which was later fixed by Nicholas Mc Guire (v3.12.8-rt11). Later (v3.18.8) the code was reworked and I dropped the patch. As it turns out it was mistake. I have reports that the same crash is possible with a similar backtrace. It seems that vanilla protects access to this_cpu_ptr() via local_bh_disable(). This does not work the on -RT since we can have NET_RX and NET_TX running in parallel on the same CPU. This is brings back the old locks. |Unable to handle kernel NULL pointer dereference at virtual address 00000010 |PC is at __ip_make_skb+0x198/0x3e8 |[<c04e39d8>] (__ip_make_skb) from [<c04e3ca8>] (ip_push_pending_frames+0x20/0x40) |[<c04e3ca8>] (ip_push_pending_frames) from [<c04e3ff0>] (ip_send_unicast_reply+0x210/0x22c) |[<c04e3ff0>] (ip_send_unicast_reply) from [<c04fbb54>] (tcp_v4_send_reset+0x190/0x1c0) |[<c04fbb54>] (tcp_v4_send_reset) from [<c04fcc1c>] (tcp_v4_do_rcv+0x22c/0x288) |[<c04fcc1c>] (tcp_v4_do_rcv) from [<c0474364>] (release_sock+0xb4/0x150) |[<c0474364>] (release_sock) from [<c04ed904>] (tcp_close+0x240/0x454) |[<c04ed904>] (tcp_close) from [<c0511408>] (inet_release+0x74/0x7c) |[<c0511408>] (inet_release) from [<c0470728>] (sock_release+0x30/0xb0) |[<c0470728>] (sock_release) from [<c0470abc>] (sock_close+0x1c/0x24) |[<c0470abc>] (sock_close) from [<c0115ec4>] (__fput+0xe8/0x20c) |[<c0115ec4>] (__fput) from [<c0116050>] (____fput+0x18/0x1c) |[<c0116050>] (____fput) from [<c0058138>] (task_work_run+0xa4/0xb8) |[<c0058138>] (task_work_run) from [<c0011478>] (do_work_pending+0xd0/0xe4) |[<c0011478>] (do_work_pending) from [<c000e740>] (work_pending+0xc/0x20) |Code: e3530001 8a000001 e3a00040 ea000011 (e5973010) Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 203/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: add a lock around icmp_sk() Date: Wed, 31 Aug 2016 17:54:09 +0200 It looks like the this_cpu_ptr() access in icmp_sk() is protected with local_bh_disable(). To avoid missing serialization in -RT I am adding here a local lock. No crash has been observed, this is just precaution. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 204/258 [ Author: Steven Rostedt Email: rostedt@goodmis.org Subject: net: Have __napi_schedule_irqoff() disable interrupts on RT Date: Tue, 6 Dec 2016 17:50:30 -0500 A customer hit a crash where the napi sd->poll_list became corrupted. The customer had the bnx2x driver, which does a __napi_schedule_irqoff() in its interrupt handler. Unfortunately, when running with CONFIG_PREEMPT_RT_FULL, this interrupt handler is run as a thread and is preemptable. The call to ____napi_schedule() must be done with interrupts disabled to protect the per cpu softnet_data's "poll_list, which is protected by disabling interrupts (disabling preemption is enough when all interrupts are threaded and local_bh_disable() can't preempt)." As bnx2x isn't the only driver that does this, the safest thing to do is to make __napi_schedule_irqoff() call __napi_schedule() instead when CONFIG_PREEMPT_RT_FULL is enabled, which will call local_irq_save() before calling ____napi_schedule(). Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt (Red Hat) <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 205/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: irqwork: push most work into softirq context Date: Tue, 23 Jun 2015 15:32:51 +0200 Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 206/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: printk: Make rt aware Date: Wed, 19 Sep 2012 14:50:37 +0200 Drop the lock before calling the console driver and do not disable interrupts while printing to a serial console. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 207/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: kernel/printk: Don't try to print from IRQ/NMI region Date: Thu, 19 May 2016 17:45:27 +0200 On -RT we try to acquire sleeping locks which might lead to warnings from lockdep or a warn_on() from spin_try_lock() (which is a rtmutex on RT). We don't print in general from a IRQ off region so we should not try this via console_unblank() / bust_spinlocks() as well. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 208/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: printk: Drop the logbuf_lock more often Date: Thu, 21 Mar 2013 19:01:05 +0100 The lock is hold with irgs off. The latency drops 500us+ on my arm bugs with a "full" buffer after executing "dmesg" on the shell. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 209/258 [ Author: Paul Gortmaker Email: paul.gortmaker@windriver.com Subject: powerpc: ps3/device-init.c - adapt to completions using swait vs wait Date: Sun, 31 May 2015 14:44:42 -0400 To fix: cc1: warnings being treated as errors arch/powerpc/platforms/ps3/device-init.c: In function 'ps3_notification_read_write': arch/powerpc/platforms/ps3/device-init.c:755:2: error: passing argument 1 of 'prepare_to_wait_event' from incompatible pointer type arch/powerpc/platforms/ps3/device-init.c:755:2: error: passing argument 1 of 'abort_exclusive_wait' from incompatible pointer type arch/powerpc/platforms/ps3/device-init.c:755:2: error: passing argument 1 of 'finish_wait' from incompatible pointer type arch/powerpc/platforms/ps3/device-init.o] Error 1 make[3]: *** Waiting for unfinished jobs.... Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 210/258 [ Author: "Yadi.hu" Email: yadi.hu@windriver.com Subject: ARM: enable irq in translation/section permission fault handlers Date: Wed, 10 Dec 2014 10:32:09 +0800 Probably happens on all ARM, with CONFIG_PREEMPT_RT_FULL CONFIG_DEBUG_ATOMIC_SLEEP This simple program.... int main() { *((char*)0xc0001000) = 0; }; [ 512.742724] BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 [ 512.743000] in_atomic(): 0, irqs_disabled(): 128, pid: 994, name: a [ 512.743217] INFO: lockdep is turned off. [ 512.743360] irq event stamp: 0 [ 512.743482] hardirqs last enabled at (0): [< (null)>] (null) [ 512.743714] hardirqs last disabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744013] softirqs last enabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744303] softirqs last disabled at (0): [< (null)>] (null) [ 512.744631] [<c041872c>] (unwind_backtrace+0x0/0x104) [ 512.745001] [<c09af0c4>] (dump_stack+0x20/0x24) [ 512.745355] [<c0462490>] (__might_sleep+0x1dc/0x1e0) [ 512.745717] [<c09b6770>] (rt_spin_lock+0x34/0x6c) [ 512.746073] [<c0441bf0>] (do_force_sig_info+0x34/0xf0) [ 512.746457] [<c0442668>] (force_sig_info+0x18/0x1c) [ 512.746829] [<c041d880>] (__do_user_fault+0x9c/0xd8) [ 512.747185] [<c041d938>] (do_bad_area+0x7c/0x94) [ 512.747536] [<c041d990>] (do_sect_fault+0x40/0x48) [ 512.747898] [<c040841c>] (do_DataAbort+0x40/0xa0) [ 512.748181] Exception stack(0xecaa1fb0 to 0xecaa1ff8) Oxc0000000 belongs to kernel address space, user task can not be allowed to access it. For above condition, correct result is that test case should receive a “segment fault” and exits but not stacks. the root cause is commit 02fe2845d6a8 ("avoid enabling interrupts in prefetch/data abort handlers"),it deletes irq enable block in Data abort assemble code and move them into page/breakpiont/alignment fault handlers instead. But author does not enable irq in translation/section permission fault handlers. ARM disables irq when it enters exception/ interrupt mode, if kernel doesn't enable irq, it would be still disabled during translation/section permission fault. We see the above splat because do_force_sig_info is still called with IRQs off, and that code eventually does a: spin_lock_irqsave(&t->sighand->siglock, flags); As this is architecture independent code, and we've not seen any other need for other arch to have the siglock converted to raw lock, we can conclude that we should enable irq for ARM translation/section permission exception. Signed-off-by: Yadi.hu <yadi.hu@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 211/258 [ Author: Josh Cartwright Email: joshc@ni.com Subject: genirq: update irq_set_irqchip_state documentation Date: Thu, 11 Feb 2016 11:54:00 -0600 On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 212/258 [ Author: Josh Cartwright Email: joshc@ni.com Subject: KVM: arm/arm64: downgrade preempt_disable()d region to migrate_disable() Date: Thu, 11 Feb 2016 11:54:01 -0600 kvm_arch_vcpu_ioctl_run() disables the use of preemption when updating the vgic and timer states to prevent the calling task from migrating to another CPU. It does so to prevent the task from writing to the incorrect per-CPU GIC distributor registers. On -rt kernels, it's possible to maintain the same guarantee with the use of migrate_{disable,enable}(), with the added benefit that the migrate-disabled region is preemptible. Update kvm_arch_vcpu_ioctl_run() to do so. Cc: Christoffer Dall <christoffer.dall@linaro.org> Reported-by: Manish Jaggi <Manish.Jaggi@caviumnetworks.com> Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 213/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm64: fpsimd: use preemp_disable in addition to local_bh_disable() Date: Wed, 25 Jul 2018 14:02:38 +0200 In v4.16-RT I noticed a number of warnings from task_fpsimd_load(). The code disables BH and expects that it is not preemptible. On -RT the task remains preemptible but remains the same CPU. This may corrupt the content of the SIMD registers if the task is preempted during saving/restoring those registers. Add preempt_disable()/enable() to enfore the required semantic on -RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 214/258 [ Author: Jason Wessel Email: jason.wessel@windriver.com Subject: kgdb/serial: Short term workaround Date: Thu, 28 Jul 2011 12:42:23 -0500 On 07/27/2011 04:37 PM, Thomas Gleixner wrote: > - KGDB (not yet disabled) is reportedly unusable on -rt right now due > to missing hacks in the console locking which I dropped on purpose. > To work around this in the short term you can use this patch, in addition to the clocksource watchdog patch that Thomas brewed up. Comments are welcome of course. Ultimately the right solution is to change separation between the console and the HW to have a polled mode + work queue so as not to introduce any kind of latency. Thanks, Jason. ] 215/258 [ Author: Clark Williams Email: williams@redhat.com Subject: sysfs: Add /sys/kernel/realtime entry Date: Sat, 30 Jul 2011 21:55:53 -0500 Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ] 216/258 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: mm, rt: kmap_atomic scheduling Date: Thu, 28 Jul 2011 10:43:51 +0200 In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ] ] 217/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: x86/highmem: Add a "already used pte" check Date: Mon, 11 Mar 2013 17:09:55 +0100 This is a copy from kmap_atomic_prot(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 218/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: arm/highmem: Flush tlb on unmap Date: Mon, 11 Mar 2013 21:37:27 +0100 The tlb should be flushed on unmap and thus make the mapping entry invalid. This is only done in the non-debug case which does not look right. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 219/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: arm: Enable highmem for rt Date: Wed, 13 Feb 2013 11:03:11 +0100 fixup highmem for ARM. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 220/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: scsi/fcoe: Make RT aware. Date: Sat, 12 Nov 2011 14:00:48 +0100 Do not disable preemption while taking sleeping locks. All user look safe for migrate_diable() only. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 221/258 [ Author: Peter Zijlstra Email: peterz@infradead.org Subject: x86: crypto: Reduce preempt disabled regions Date: Mon, 14 Nov 2011 18:19:27 +0100 Restrict the preempt disabled regions to the actual floating point operations and enable preemption for the administrative actions. This is necessary on RT to avoid that kfree and other operations are called with preemption disabled. Reported-and-tested-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 222/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: Reduce preempt disabled regions, more algos Date: Fri, 21 Feb 2014 17:24:04 +0100 Don Estabrook reported | kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() | kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2462 migrate_enable+0x17b/0x200() | kernel: WARNING: CPU: 3 PID: 865 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() and his backtrace showed some crypto functions which looked fine. The problem is the following sequence: glue_xts_crypt_128bit() { blkcipher_walk_virt(); /* normal migrate_disable() */ glue_fpu_begin(); /* get atomic */ while (nbytes) { __glue_xts_crypt_128bit(); blkcipher_walk_done(); /* with nbytes = 0, migrate_enable() * while we are atomic */ }; glue_fpu_end() /* no longer atomic */ } and this is why the counter get out of sync and the warning is printed. The other problem is that we are non-preemptible between glue_fpu_begin() and glue_fpu_end() and the latency grows. To fix this, I shorten the FPU off region and ensure blkcipher_walk_done() is called with preemption enabled. This might hurt the performance because we now enable/disable the FPU state more often but we gain lower latency and the bug is gone. Reported-by: Don Estabrook <don.estabrook@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 223/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: limit more FPU-enabled sections Date: Thu, 30 Nov 2017 13:40:10 +0100 Those crypto drivers use SSE/AVX/… for their crypto work and in order to do so in kernel they need to enable the "FPU" in kernel mode which disables preemption. There are two problems with the way they are used: - the while loop which processes X bytes may create latency spikes and should be avoided or limited. - the cipher-walk-next part may allocate/free memory and may use kmap_atomic(). The whole kernel_fpu_begin()/end() processing isn't probably that cheap. It most likely makes sense to process as much of those as possible in one go. The new *_fpu_sched_rt() schedules only if a RT task is pending. Probably we should measure the performance those ciphers in pure SW mode and with this optimisations to see if it makes sense to keep them for RT. This kernel_fpu_resched() makes the code more preemptible which might hurt performance. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 224/258 [ Author: Mike Galbraith Email: efault@gmx.de Subject: crypto: scompress - serialize RT percpu scratch buffer access with a local lock Date: Wed, 11 Jul 2018 17:14:47 +0200 | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:974 | in_atomic(): 1, irqs_disabled(): 0, pid: 1401, name: cryptomgr_test | Preemption disabled at: | [<ffff00000849941c>] scomp_acomp_comp_decomp+0x34/0x1a0 | CPU: 21 PID: 1401 Comm: cryptomgr_test Tainted: G W 4.16.18-rt9-rt #1 | Hardware name: www.cavium.com crb-1s/crb-1s, BIOS 0.3 Apr 25 2017 | Call trace: | dump_backtrace+0x0/0x1c8 | show_stack+0x24/0x30 | dump_stack+0xac/0xe8 | ___might_sleep+0x124/0x188 | rt_spin_lock+0x40/0x88 | zip_load_instr+0x44/0x170 [thunderx_zip] | zip_deflate+0x184/0x378 [thunderx_zip] | zip_compress+0xb0/0x130 [thunderx_zip] | zip_scomp_compress+0x48/0x60 [thunderx_zip] | scomp_acomp_comp_decomp+0xd8/0x1a0 | scomp_acomp_compress+0x24/0x30 | test_acomp+0x15c/0x558 | alg_test_comp+0xc0/0x128 | alg_test.part.6+0x120/0x2c0 | alg_test+0x6c/0xa0 | cryptomgr_test+0x50/0x58 | kthread+0x134/0x138 | ret_from_fork+0x10/0x18 Mainline disables preemption to serialize percpu scratch buffer access, causing the splat above. Serialize with a local lock for RT instead. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 225/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: crypto: cryptd - add a lock instead preempt_disable/local_bh_disable Date: Thu, 26 Jul 2018 18:52:00 +0200 cryptd has a per-CPU lock which protected with local_bh_disable() and preempt_disable(). Add an explicit spin_lock to make the locking context more obvious and visible to lockdep. Since it is a per-CPU lock, there should be no lock contention on the actual spinlock. There is a small race-window where we could be migrated to another CPU after the cpu_queue has been obtain. This is not a problem because the actual ressource is protected by the spinlock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 226/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: panic: skip get_random_bytes for RT_FULL in init_oops_id Date: Tue, 14 Jul 2015 14:26:34 +0200 Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 227/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: stackprotector: Avoid random pool on rt Date: Thu, 16 Dec 2010 14:25:18 +0100 CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde <carsten.emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 228/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: random: Make it work on rt Date: Tue, 21 Aug 2012 20:38:50 +0200 Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 229/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: random: avoid preempt_disable()ed section Date: Fri, 12 May 2017 15:46:17 +0200 extract_crng() will use sleeping locks while in a preempt_disable() section due to get_cpu_var(). Work around it with local_locks. Cc: stable-rt@vger.kernel.org # where it applies to Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 230/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: cpu/hotplug: Implement CPU pinning Date: Wed, 19 Jul 2017 17:31:20 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 231/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: hotplug: duct-tape RT-rwlock usage for non-RT Date: Fri, 4 Aug 2017 18:31:00 +0200 This type is only available on -RT. We need to craft something for non-RT. Since the only migrate_disable() user is -RT only, there is no damage. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 232/258 [ Author: Priyanka Jain Email: Priyanka.Jain@freescale.com Subject: net: Remove preemption disabling in netif_rx() Date: Thu, 17 May 2012 09:35:11 +0530 1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. 3)When CONFIG_PREEMPT_RT_FULL is not defined, migrate_enable(), migrate_disable() maps to preempt_enable() and preempt_disable(), so no change in functionality in case of non-RT. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com> Acked-by: Rajan Srivastava <Rajan.Srivastava@freescale.com> Cc: <rostedt@goodmis.orgn> Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 233/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: Another local_irq_disable/kmalloc headache Date: Wed, 26 Sep 2012 16:21:08 +0200 Replace it by a local lock. Though that's pretty inefficient :( Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 234/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net/core: protect users of napi_alloc_cache against reentrance Date: Fri, 15 Jan 2016 16:33:34 +0100 On -RT the code running in BH can not be moved to another CPU so CPU local variable remain local. However the code can be preempted and another task may enter BH accessing the same CPU using the same napi_alloc_cache variable. This patch ensures that each user of napi_alloc_cache uses a local lock. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 235/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: net: netfilter: Serialize xt_write_recseq sections on RT Date: Sun, 28 Oct 2012 11:18:08 +0100 The netfilter code relies only on the implicit semantics of local_bh_disable() for serializing wt_write_recseq sections. RT breaks that and needs explicit serialization here. Reported-by: Peter LaDow <petela@gocougs.wsu.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 236/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: net: Add a mutex around devnet_rename_seq Date: Wed, 20 Mar 2013 18:06:20 +0100 On RT write_seqcount_begin() disables preemption and device_rename() allocates memory with GFP_KERNEL and grabs later the sysfs_mutex mutex. Serialize with a mutex and add use the non preemption disabling __write_seqcount_begin(). To avoid writer starvation, let the reader grab the mutex and release it when it detects a writer in progress. This keeps the normal case (no reader on the fly) fast. [ tglx: Instead of replacing the seqcount by a mutex, add the mutex ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 237/258 [ Author: Yong Zhang Email: yong.zhang@windriver.com Subject: lockdep: selftest: Only do hardirq context test for raw spinlock Date: Mon, 16 Apr 2012 15:01:56 +0800 On -rt there is no softirq context any more and rwlock is sleepable, disable softirq context test and rwlock+irq test. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Yong Zhang <yong.zhang@windriver.com> Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 238/258 [ Author: Josh Cartwright Email: josh.cartwright@ni.com Subject: lockdep: selftest: fix warnings due to missing PREEMPT_RT conditionals Date: Wed, 28 Jan 2015 13:08:45 -0600 "lockdep: Selftest: Only do hardirq context test for raw spinlock" disabled the execution of certain tests with PREEMPT_RT_FULL, but did not prevent the tests from still being defined. This leads to warnings like: ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:580:1: warning: 'irqsafe1_soft_spin_12' defined but not used [-Wunused-function] ... Fixed by wrapping the test definitions in #ifndef CONFIG_PREEMPT_RT_FULL conditionals. Signed-off-by: Josh Cartwright <josh.cartwright@ni.com> Signed-off-by: Xander Huff <xander.huff@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 239/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: sched: Add support for lazy preemption Date: Fri, 26 Oct 2012 18:50:54 +0100 It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 240/258 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: ftrace: Fix trace header alignment Date: Sun, 16 Oct 2016 05:08:30 +0200 Line up helper arrows to the right column. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [bigeasy: fixup function tracer header] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 241/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: x86: Support for lazy preemption Date: Thu, 1 Nov 2012 11:03:47 +0100 Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 242/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: arm: Add support for lazy preemption Date: Wed, 31 Oct 2012 12:04:11 +0100 Implement the arm pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 243/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: powerpc: Add support for lazy preemption Date: Thu, 1 Nov 2012 10:14:11 +0100 Implement the powerpc pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 244/258 [ Author: Anders Roxell Email: anders.roxell@linaro.org Subject: arch/arm64: Add lazy preempt support Date: Thu, 14 May 2015 17:52:17 +0200 arm64 is missing support for PREEMPT_RT. The main feature which is lacking is support for lazy preemption. The arch-specific entry code, thread information structure definitions, and associated data tables have to be extended to provide this support. Then the Kconfig file has to be extended to indicate the support is available, and also to indicate that support for full RT preemption is now available. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> ] 245/258 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: connector/cn_proc: Protect send_msg() with a local lock on RT Date: Sun, 16 Oct 2016 05:11:54 +0200 |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:931 |in_atomic(): 1, irqs_disabled(): 0, pid: 31807, name: sleep |Preemption disabled at:[<ffffffff8148019b>] proc_exit_connector+0xbb/0x140 | |CPU: 4 PID: 31807 Comm: sleep Tainted: G W E 4.8.0-rt11-rt #106 |Call Trace: | [<ffffffff813436cd>] dump_stack+0x65/0x88 | [<ffffffff8109c425>] ___might_sleep+0xf5/0x180 | [<ffffffff816406b0>] __rt_spin_lock+0x20/0x50 | [<ffffffff81640978>] rt_read_lock+0x28/0x30 | [<ffffffff8156e209>] netlink_broadcast_filtered+0x49/0x3f0 | [<ffffffff81522621>] ? __kmalloc_reserve.isra.33+0x31/0x90 | [<ffffffff8156e5cd>] netlink_broadcast+0x1d/0x20 | [<ffffffff8147f57a>] cn_netlink_send_mult+0x19a/0x1f0 | [<ffffffff8147f5eb>] cn_netlink_send+0x1b/0x20 | [<ffffffff814801d8>] proc_exit_connector+0xf8/0x140 | [<ffffffff81077f71>] do_exit+0x5d1/0xba0 | [<ffffffff810785cc>] do_group_exit+0x4c/0xc0 | [<ffffffff81078654>] SyS_exit_group+0x14/0x20 | [<ffffffff81640a72>] entry_SYSCALL_64_fastpath+0x1a/0xa4 Since ab8ed951080e ("connector: fix out-of-order cn_proc netlink message delivery") which is v4.7-rc6. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 246/258 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drivers/block/zram: Replace bit spinlocks with rtmutex for -rt Date: Thu, 31 Mar 2016 04:08:28 +0200 They're nondeterministic, and lead to ___might_sleep() splats in -rt. OTOH, they're a lot less wasteful than an rtmutex per page. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 247/258 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drivers/zram: Don't disable preemption in zcomp_stream_get/put() Date: Thu, 20 Oct 2016 11:15:22 +0200 In v4.7, the driver switched to percpu compression streams, disabling preemption via get/put_cpu_ptr(). Use a per-zcomp_strm lock here. We also have to fix an lock order issue in zram_decompress_page() such that zs_map_object() nests inside of zcomp_stream_put() as it does in zram_bvec_write(). Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [bigeasy: get_locked_var() -> per zcomp_strm lock] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 248/258 [ Author: Mike Galbraith Email: efault@gmx.de Subject: drivers/zram: fix zcomp_stream_get() smp_processor_id() use in preemptible code Date: Wed, 23 Aug 2017 11:57:29 +0200 Use get_local_ptr() instead this_cpu_ptr() to avoid a warning regarding smp_processor_id() in preemptible code. raw_cpu_ptr() would be fine, too because the per-CPU data structure is protected with a spin lock so it does not matter much if we take the other one. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 249/258 [ Author: Haris Okanovic Email: haris.okanovic@ni.com Subject: tpm_tis: fix stall after iowrite*()s Date: Tue, 15 Aug 2017 15:13:08 -0500 ioread8() operations to TPM MMIO addresses can stall the cpu when immediately following a sequence of iowrite*()'s to the same region. For example, cyclitest measures ~400us latency spikes when a non-RT usermode application communicates with an SPI-based TPM chip (Intel Atom E3940 system, PREEMPT_RT_FULL kernel). The spikes are caused by a stalling ioread8() operation following a sequence of 30+ iowrite8()s to the same address. I believe this happens because the write sequence is buffered (in cpu or somewhere along the bus), and gets flushed on the first LOAD instruction (ioread*()) that follows. The enclosed change appears to fix this issue: read the TPM chip's access register (status code) after every iowrite*() operation to amortize the cost of flushing data to chip across multiple instructions. Signed-off-by: Haris Okanovic <haris.okanovic@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 250/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: pci/switchtec: Don't use completion's wait queue Date: Wed, 4 Oct 2017 10:24:23 +0200 The poll callback is using completion's wait_queue_head_t member and puts it in poll_wait() so the poll() caller gets a wakeup after command completed. This does not work on RT because we don't have a wait_queue_head_t in our completion implementation. Nobody in tree does like that in tree so this is the only driver that breaks. Instead of using the completion here is waitqueue with a status flag as suggested by Logan. I don't have the HW so I have no idea if it works as expected, so please test it. Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com> Cc: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 251/258 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drm,radeon,i915: Use preempt_disable/enable_rt() where recommended Date: Sat, 27 Feb 2016 08:09:11 +0100 DRM folks identified the spots, so use them. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 252/258 [ Author: Mike Galbraith Email: umgwanakikbuti@gmail.com Subject: drm,i915: Use local_lock/unlock_irq() in intel_pipe_update_start/end() Date: Sat, 27 Feb 2016 09:01:42 +0100 [ 8.014039] BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:918 [ 8.014041] in_atomic(): 0, irqs_disabled(): 1, pid: 78, name: kworker/u4:4 [ 8.014045] CPU: 1 PID: 78 Comm: kworker/u4:4 Not tainted 4.1.7-rt7 #5 [ 8.014055] Workqueue: events_unbound async_run_entry_fn [ 8.014059] 0000000000000000 ffff880037153748 ffffffff815f32c9 0000000000000002 [ 8.014063] ffff88013a50e380 ffff880037153768 ffffffff815ef075 ffff8800372c06c8 [ 8.014066] ffff8800372c06c8 ffff880037153778 ffffffff8107c0b3 ffff880037153798 [ 8.014067] Call Trace: [ 8.014074] [<ffffffff815f32c9>] dump_stack+0x4a/0x61 [ 8.014078] [<ffffffff815ef075>] ___might_sleep.part.93+0xe9/0xee [ 8.014082] [<ffffffff8107c0b3>] ___might_sleep+0x53/0x80 [ 8.014086] [<ffffffff815f9064>] rt_spin_lock+0x24/0x50 [ 8.014090] [<ffffffff8109368b>] prepare_to_wait+0x2b/0xa0 [ 8.014152] [<ffffffffa016c04c>] intel_pipe_update_start+0x17c/0x300 [i915] [ 8.014156] [<ffffffff81093b40>] ? prepare_to_wait_event+0x120/0x120 [ 8.014201] [<ffffffffa0158f36>] intel_begin_crtc_commit+0x166/0x1e0 [i915] [ 8.014215] [<ffffffffa00c806d>] drm_atomic_helper_commit_planes+0x5d/0x1a0 [drm_kms_helper] [ 8.014260] [<ffffffffa0171e9b>] intel_atomic_commit+0xab/0xf0 [i915] [ 8.014288] [<ffffffffa00654c7>] drm_atomic_commit+0x37/0x60 [drm] [ 8.014298] [<ffffffffa00c6fcd>] drm_atomic_helper_plane_set_property+0x8d/0xd0 [drm_kms_helper] [ 8.014301] [<ffffffff815f77d9>] ? __ww_mutex_lock+0x39/0x40 [ 8.014319] [<ffffffffa0053b3d>] drm_mode_plane_set_obj_prop+0x2d/0x90 [drm] [ 8.014328] [<ffffffffa00c8edb>] restore_fbdev_mode+0x6b/0xf0 [drm_kms_helper] [ 8.014337] [<ffffffffa00cae49>] drm_fb_helper_restore_fbdev_mode_unlocked+0x29/0x80 [drm_kms_helper] [ 8.014346] [<ffffffffa00caec2>] drm_fb_helper_set_par+0x22/0x50 [drm_kms_helper] [ 8.014390] [<ffffffffa016dfba>] intel_fbdev_set_par+0x1a/0x60 [i915] [ 8.014394] [<ffffffff81327dc4>] fbcon_init+0x4f4/0x580 [ 8.014398] [<ffffffff8139ef4c>] visual_init+0xbc/0x120 [ 8.014401] [<ffffffff813a1623>] do_bind_con_driver+0x163/0x330 [ 8.014405] [<ffffffff813a1b2c>] do_take_over_console+0x11c/0x1c0 [ 8.014408] [<ffffffff813236e3>] do_fbcon_takeover+0x63/0xd0 [ 8.014410] [<ffffffff81328965>] fbcon_event_notify+0x785/0x8d0 [ 8.014413] [<ffffffff8107c12d>] ? __might_sleep+0x4d/0x90 [ 8.014416] [<ffffffff810775fe>] notifier_call_chain+0x4e/0x80 [ 8.014419] [<ffffffff810779cd>] __blocking_notifier_call_chain+0x4d/0x70 [ 8.014422] [<ffffffff81077a06>] blocking_notifier_call_chain+0x16/0x20 [ 8.014425] [<ffffffff8132b48b>] fb_notifier_call_chain+0x1b/0x20 [ 8.014428] [<ffffffff8132d8fa>] register_framebuffer+0x21a/0x350 [ 8.014439] [<ffffffffa00cb164>] drm_fb_helper_initial_config+0x274/0x3e0 [drm_kms_helper] [ 8.014483] [<ffffffffa016f1cb>] intel_fbdev_initial_config+0x1b/0x20 [i915] [ 8.014486] [<ffffffff8107912c>] async_run_entry_fn+0x4c/0x160 [ 8.014490] [<ffffffff81070ffa>] process_one_work+0x14a/0x470 [ 8.014493] [<ffffffff81071489>] worker_thread+0x169/0x4c0 [ 8.014496] [<ffffffff81071320>] ? process_one_work+0x470/0x470 [ 8.014499] [<ffffffff81076606>] kthread+0xc6/0xe0 [ 8.014502] [<ffffffff81070000>] ? queue_work_on+0x80/0x110 [ 8.014506] [<ffffffff81076540>] ? kthread_worker_fn+0x1c0/0x1c0 Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 253/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: cgroups: use simple wait in css_release() Date: Fri, 13 Feb 2015 15:52:24 +0100 To avoid: |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:914 |in_atomic(): 1, irqs_disabled(): 0, pid: 92, name: rcuc/11 |2 locks held by rcuc/11/92: | #0: (rcu_callback){......}, at: [<ffffffff810e037e>] rcu_cpu_kthread+0x3de/0x940 | #1: (rcu_read_lock_sched){......}, at: [<ffffffff81328390>] percpu_ref_call_confirm_rcu+0x0/0xd0 |Preemption disabled at:[<ffffffff813284e2>] percpu_ref_switch_to_atomic_rcu+0x82/0xc0 |CPU: 11 PID: 92 Comm: rcuc/11 Not tainted 3.18.7-rt0+ #1 | ffff8802398cdf80 ffff880235f0bc28 ffffffff815b3a12 0000000000000000 | 0000000000000000 ffff880235f0bc48 ffffffff8109aa16 0000000000000000 | ffff8802398cdf80 ffff880235f0bc78 ffffffff815b8dd4 000000000000df80 |Call Trace: | [<ffffffff815b3a12>] dump_stack+0x4f/0x7c | [<ffffffff8109aa16>] __might_sleep+0x116/0x190 | [<ffffffff815b8dd4>] rt_spin_lock+0x24/0x60 | [<ffffffff8108d2cd>] queue_work_on+0x6d/0x1d0 | [<ffffffff8110c881>] css_release+0x81/0x90 | [<ffffffff8132844e>] percpu_ref_call_confirm_rcu+0xbe/0xd0 | [<ffffffff813284e2>] percpu_ref_switch_to_atomic_rcu+0x82/0xc0 | [<ffffffff810e03e5>] rcu_cpu_kthread+0x445/0x940 | [<ffffffff81098a2d>] smpboot_thread_fn+0x18d/0x2d0 | [<ffffffff810948d8>] kthread+0xe8/0x100 | [<ffffffff815b9c3c>] ret_from_fork+0x7c/0xb0 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 254/258 [ Author: Mike Galbraith Email: efault@gmx.de Subject: cpuset: Convert callback_lock to raw_spinlock_t Date: Sun, 8 Jan 2017 09:32:25 +0100 The two commits below add up to a cpuset might_sleep() splat for RT: 8447a0fee974 cpuset: convert callback_mutex to a spinlock 344736f29b35 cpuset: simplify cpuset_node_allowed API BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset CPU: 135 PID: 11718 Comm: cset Tainted: G E 4.10.0-rt1-rt #4 Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 Call Trace: ? dump_stack+0x5c/0x81 ? ___might_sleep+0xf4/0x170 ? rt_spin_lock+0x1c/0x50 ? __cpuset_node_allowed+0x66/0xc0 ? ___slab_alloc+0x390/0x570 <disables IRQs> ? anon_vma_fork+0x8f/0x140 ? copy_page_range+0x6cf/0xb00 ? anon_vma_fork+0x8f/0x140 ? __slab_alloc.isra.74+0x5a/0x81 ? anon_vma_fork+0x8f/0x140 ? kmem_cache_alloc+0x1b5/0x1f0 ? anon_vma_fork+0x8f/0x140 ? copy_process.part.35+0x1670/0x1ee0 ? _do_fork+0xdd/0x3f0 ? _do_fork+0xdd/0x3f0 ? do_syscall_64+0x61/0x170 ? entry_SYSCALL64_slow_path+0x25/0x25 The later ensured that a NUMA box WILL take callback_lock in atomic context by removing the allocator and reclaim path __GFP_HARDWALL usage which prevented such contexts from taking callback_mutex. One option would be to reinstate __GFP_HARDWALL protections for RT, however, as the 8447a0fee974 changelog states: The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 255/258 [ Author: Sebastian Andrzej Siewior Email: bigeasy@linutronix.de Subject: apparmor: use a locallock instead preempt_disable() Date: Wed, 11 Oct 2017 17:43:49 +0200 get_buffers() disables preemption which acts as a lock for the per-CPU variable. Since we can't disable preemption here on RT, a local_lock is lock is used in order to remain on the same CPU and not to have more than one user within the critical section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ] 256/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: workqueue: Prevent deadlock/stall on RT Date: Fri, 27 Jun 2014 16:24:52 +0200 Austin reported a XFS deadlock/stall on RT where scheduled work gets never exececuted and tasks are waiting for each other for ever. The underlying problem is the modification of the RT code to the handling of workers which are about to go to sleep. In mainline a worker thread which goes to sleep wakes an idle worker if there is more work to do. This happens from the guts of the schedule() function. On RT this must be outside and the accessed data structures are not protected against scheduling due to the spinlock to rtmutex conversion. So the naive solution to this was to move the code outside of the scheduler and protect the data structures by the pool lock. That approach turned out to be a little naive as we cannot call into that code when the thread blocks on a lock, as it is not allowed to block on two locks in parallel. So we dont call into the worker wakeup magic when the worker is blocked on a lock, which causes the deadlock/stall observed by Austin and Mike. Looking deeper into that worker code it turns out that the only relevant data structure which needs to be protected is the list of idle workers which can be woken up. So the solution is to protect the list manipulation operations with preempt_enable/disable pairs on RT and call unconditionally into the worker code even when the worker is blocked on a lock. The preemption protection is safe as there is nothing which can fiddle with the list outside of thread context. Reported-and_tested-by: Austin Schuh <austin@peloton-tech.com> Reported-and_tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://vger.kernel.org/r/alpine.DEB.2.10.1406271249510.5170@nanos Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> ] 257/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: signals: Allow rt tasks to cache one sigqueue struct Date: Fri, 3 Jul 2009 08:44:56 -0500 To avoid allocation allow rt tasks to cache one sigqueue struct in task struct. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] 258/258 [ Author: Thomas Gleixner Email: tglx@linutronix.de Subject: Add localversion for -RT release Date: Fri, 8 Jul 2011 20:25:16 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ] Signed-off-by: Bruce Ashfield <bruce.ashfield@windriver.com>