summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2020-07-24hwmon: (acpi_power_meter) Fix potential memory leak in acpi_power_meter_add()Misono Tomohiro
commit 8b97f9922211c44a739c5cbd9502ecbb9f17f6d1 upstream. Although it rarely happens, we should call free_capabilities() if error happens after read_capabilities() to free allocated strings. Fixes: de584afa5e188 ("hwmon driver for ACPI 4.0 power meters") Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Link: https://lore.kernel.org/r/20200625043242.31175-1-misono.tomohiro@jp.fujitsu.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24hwmon: (max6697) Make sure the OVERT mask is set correctlyChu Lin
commit 016983d138cbe99a5c0aaae0103ee88f5300beb3 upstream. Per the datasheet for max6697, OVERT mask and ALERT mask are different. For example, the 7th bit of OVERT is the local channel but for alert mask, the 6th bit is the local channel. Therefore, we can't apply the same mask for both registers. In addition to that, the max6697 driver is supposed to be compatibale with different models. I manually went over all the listed chips and made sure all chip types have the same layout. Testing; mask value of 0x9 should map to 0x44 for ALERT and 0x84 for OVERT. I used iotool to read the reg value back to verify. I only tested this change on max6581. Reference: https://datasheets.maximintegrated.com/en/ds/MAX6581.pdf https://datasheets.maximintegrated.com/en/ds/MAX6697.pdf https://datasheets.maximintegrated.com/en/ds/MAX6699.pdf Signed-off-by: Chu Lin <linchuyuan@google.com> Fixes: 5372d2d71c46e ("hwmon: Driver for Maxim MAX6697 and compatibles") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24cxgb4: fix SGE queue dump destination buffer contextRahul Lakkireddy
commit 1992ded5d111997877a9a25205976d8d03c46814 upstream. The data in destination buffer is expected to be be parsed in big endian. So, use the right context. Fixes following sparse warning: cudbg_lib.c:2041:44: warning: incorrect type in assignment (different base types) cudbg_lib.c:2041:44: expected unsigned long long [usertype] cudbg_lib.c:2041:44: got restricted __be64 [usertype] Fixes: 736c3b94474e ("cxgb4: collect egress and ingress SGE queue contexts") Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24cxgb4: use correct type for all-mask IP address comparisonRahul Lakkireddy
commit f286dd8eaad5a2758750f407ab079298e0bcc8a5 upstream. Use correct type to check for all-mask exact match IP addresses. Fixes following sparse warnings due to big endian value checks against 0xffffffff in is_addr_all_mask(): cxgb4_filter.c:977:25: warning: restricted __be32 degrades to integer cxgb4_filter.c:983:37: warning: restricted __be32 degrades to integer cxgb4_filter.c:984:37: warning: restricted __be32 degrades to integer cxgb4_filter.c:985:37: warning: restricted __be32 degrades to integer cxgb4_filter.c:986:37: warning: restricted __be32 degrades to integer Fixes: 3eb8b62d5a26 ("cxgb4: add support to create hash-filters via tc-flower offload") Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24cxgb4: fix endian conversions for L4 ports in filtersRahul Lakkireddy
commit 63b53b0b99cd5f2d9754a21eda2ed8e706646cc9 upstream. The source and destination L4 ports in filter offload need to be in CPU endian. They will finally be converted to Big Endian after all operations are done and before giving them to hardware. The L4 ports for NAT are expected to be passed as a byte stream TCB. So, treat them as such. Fixes following sparse warnings in several places: cxgb4_tc_flower.c:159:33: warning: cast from restricted __be16 cxgb4_tc_flower.c:159:33: warning: incorrect type in argument 1 (different base types) cxgb4_tc_flower.c:159:33: expected unsigned short [usertype] val cxgb4_tc_flower.c:159:33: got restricted __be16 [usertype] dst Fixes: dca4faeb812f ("cxgb4: Add LE hash collision bug fix path in LLD driver") Fixes: 62488e4b53ae ("cxgb4: add basic tc flower offload support") Fixes: 557ccbf9dfa8 ("cxgb4: add tc flower support for L3/L4 rewrite") Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24cxgb4: parse TC-U32 key values and masks nativelyRahul Lakkireddy
commit 27f78cb245abdb86735529c13b0a579f57829e71 upstream. TC-U32 passes all keys values and masks in __be32 format. The parser already expects this and hence pass the value and masks in __be32 natively to the parser. Fixes following sparse warnings in several places: cxgb4_tc_u32.c:57:21: warning: incorrect type in assignment (different base types) cxgb4_tc_u32.c:57:21: expected unsigned int [usertype] val cxgb4_tc_u32.c:57:21: got restricted __be32 [usertype] val cxgb4_tc_u32_parse.h:48:24: warning: cast to restricted __be32 Fixes: 2e8aad7bf203 ("cxgb4: add parser to translate u32 filters to internal spec") Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24cxgb4: use unaligned conversion for fetching timestampRahul Lakkireddy
commit 589b1c9c166dce120e27b32a83a78f55464a7ef9 upstream. Use get_unaligned_be64() to fetch the timestamp needed for ns_to_ktime() conversion. Fixes following sparse warning: sge.c:3282:43: warning: cast to restricted __be64 Fixes: a456950445a0 ("cxgb4: time stamping interface for PTP") Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24rxrpc: Fix afs large storage transmission performance dropDavid Howells
commit 02c28dffb13abbaaedece1e4a6493b48ad3f913a upstream. Commit 2ad6691d988c, which moved the modification of the status annotation for a packet in the Tx buffer prior to the retransmission moved the state clearance, but managed to lose the bit that set it to UNACK. Consequently, if a retransmission occurs, the packet is accidentally changed to the ACK state (ie. 0) by masking it off, which means that the packet isn't counted towards the tally of newly-ACK'd packets if it gets hard-ACK'd. This then prevents the congestion control algorithm from recovering properly. Fix by reinstating the change of state to UNACK. Spotted by the generic/460 xfstest. Fixes: 2ad6691d988c ("rxrpc: Fix race between incoming ACK parser and retransmitter") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24drm/msm/dpu: fix error return code in dpu_encoder_initChen Tao
commit aa472721c8dbe1713cf510f56ffbc56ae9e14247 upstream. Fix to return negative error code -ENOMEM with the use of ERR_PTR from dpu_encoder_init. Fixes: 25fdd5933e4c ("drm/msm: Add SDM845 DPU support") Signed-off-by: Chen Tao <chentao107@huawei.com> Signed-off-by: Rob Clark <robdclark@chromium.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24selftests: tpm: Use /bin/sh instead of /bin/bashJarkko Sakkinen
commit 377ff83083c953dd58c5a030b3c9b5b85d8cc727 upstream. It's better to use /bin/sh instead of /bin/bash in order to run the tests in the BusyBox shell. Fixes: 6ea3dfe1e073 ("selftests: add TPM 2.0 tests") Cc: stable@vger.kernel.org Cc: linux-integrity@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24kgdb: Avoid suspicious RCU usage warningDouglas Anderson
commit 440ab9e10e2e6e5fd677473ee6f9e3af0f6904d6 upstream. At times when I'm using kgdb I see a splat on my console about suspicious RCU usage. I managed to come up with a case that could reproduce this that looked like this: WARNING: suspicious RCU usage 5.7.0-rc4+ #609 Not tainted ----------------------------- kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by swapper/0/1: #0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c #1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac #2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac stack backtrace: CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609 Hardware name: Google Cheza (rev3+) (DT) Call trace: dump_backtrace+0x0/0x1b8 show_stack+0x1c/0x24 dump_stack+0xd4/0x134 lockdep_rcu_suspicious+0xf0/0x100 find_task_by_pid_ns+0x5c/0x80 getthread+0x8c/0xb0 gdb_serial_stub+0x9d4/0xd04 kgdb_cpu_enter+0x284/0x7ac kgdb_handle_exception+0x174/0x20c kgdb_brk_fn+0x24/0x30 call_break_hook+0x6c/0x7c brk_handler+0x20/0x5c do_debug_exception+0x1c8/0x22c el1_sync_handler+0x3c/0xe4 el1_sync+0x7c/0x100 rpmh_rsc_probe+0x38/0x420 platform_drv_probe+0x94/0xb4 really_probe+0x134/0x300 driver_probe_device+0x68/0x100 __device_attach_driver+0x90/0xa8 bus_for_each_drv+0x84/0xcc __device_attach+0xb4/0x13c device_initial_probe+0x18/0x20 bus_probe_device+0x38/0x98 device_add+0x38c/0x420 If I understand properly we should just be able to blanket kgdb under one big RCU read lock and the problem should go away. We'll add it to the beast-of-a-function known as kgdb_cpu_enter(). With this I no longer get any splats and things seem to work fine. Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20200602154729.v2.1.I70e0d4fd46d5ed2aaf0c98a355e8e1b7a5bb7e4e@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24nvme-multipath: fix bogus request queue reference putSagi Grimberg
commit c31244669f57963b6ce133a5555b118fc50aec95 upstream. The mpath disk node takes a reference on the request mpath request queue when adding live path to the mpath gendisk. However if we connected to an inaccessible path device_add_disk is not called, so if we disconnect and remove the mpath gendisk we endup putting an reference on the request queue that was never taken [1]. Fix that to check if we ever added a live path (using NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue reference. [1]: ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0 CPU: 1 PID: 1372 Comm: nvme Tainted: G O 5.7.0-rc2+ #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014 RIP: 0010:refcount_warn_saturate+0xa6/0xf0 RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007 RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980 RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0 FS: 00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: disk_release+0xa2/0xc0 device_release+0x28/0x80 kobject_put+0xa5/0x1b0 nvme_put_ns_head+0x26/0x70 [nvme_core] nvme_put_ns+0x30/0x60 [nvme_core] nvme_remove_namespaces+0x9b/0xe0 [nvme_core] nvme_do_delete_ctrl+0x43/0x5c [nvme_core] nvme_sysfs_delete.cold+0x8/0xd [nvme_core] kernfs_fop_write+0xc1/0x1a0 vfs_write+0xb6/0x1a0 ksys_write+0x5f/0xe0 do_syscall_64+0x52/0x1a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: Anton Eidelman <anton@lightbitslabs.com> Tested-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24nvme-multipath: fix deadlock due to head->lockAnton Eidelman
commit d8a22f85609fadb46ba699e0136cc3ebdeebff79 upstream. In the following scenario scan_work and ana_work will deadlock: When scan_work calls nvme_mpath_add_disk() this holds ana_lock and invokes nvme_parse_ana_log(), which may issue IO in device_add_disk() and hang waiting for an accessible path. While nvme_mpath_set_live() only called when nvme_state_is_live(), a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO. Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on ANY ctrl will not be able to complete nvme_mpath_set_live() on the same ns->head, which is required in order to update the new accessible path and remove NVME_NS_ANA_PENDING.. Therefore IO never completes: deadlock [1]. Fix: Move device_add_disk out of the head->lock and protect it with an atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit. [1]: kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds. kernel: Tainted: G OE 5.3.5-050305-generic #201910071830 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kworker/u8:2 D 0 160 2 0x80004000 kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: schedule_preempt_disabled+0xe/0x10 kernel: __mutex_lock.isra.0+0x182/0x4f0 kernel: __mutex_lock_slowpath+0x13/0x20 kernel: mutex_lock+0x2e/0x40 kernel: nvme_update_ns_ana_state+0x22/0x60 [nvme_core] kernel: nvme_update_ana_state+0xca/0xe0 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_read_ana_log+0x76/0x100 [nvme_core] kernel: nvme_ana_work+0x15/0x20 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x4d/0x400 kernel: kthread+0x104/0x140 kernel: ret_from_fork+0x35/0x40 kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds. kernel: Tainted: G OE 5.3.5-050305-generic #201910071830 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kworker/u8:4 D 0 439 2 0x80004000 kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_mpath_add_disk+0xbe/0x100 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x256/0x390 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x4d/0x400 kernel: kthread+0x104/0x140 kernel: ret_from_fork+0x35/0x40 Fixes: 0d0b660f214d ("nvme: add ANA support") Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24nvme-multipath: fix deadlock between ana_work and scan_workAnton Eidelman
commit 489dd102a2c7c94d783a35f9412eb085b8da1aa4 upstream. When scan_work calls nvme_mpath_add_disk() this holds ana_lock and invokes nvme_parse_ana_log(), which may issue IO in device_add_disk() and hang waiting for an accessible path. While nvme_mpath_set_live() only called when nvme_state_is_live(), a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO. In order to recover and complete the IO ana_work on the same ctrl should be able to update the path state and remove NVME_NS_ANA_PENDING. The deadlock occurs because scan_work keeps holding ana_lock, so ana_work hangs [1]. Fix: Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy of the ANA group desc, and then calls nvme_update_ns_ana_state() without holding ana_lock. [1]: kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: schedule_preempt_disabled+0xe/0x10 kernel: __mutex_lock.isra.0+0x182/0x4f0 kernel: ? __switch_to_asm+0x34/0x70 kernel: ? select_task_rq_fair+0x1aa/0x5c0 kernel: ? kvm_sched_clock_read+0x11/0x20 kernel: ? sched_clock+0x9/0x10 kernel: __mutex_lock_slowpath+0x13/0x20 kernel: mutex_lock+0x2e/0x40 kernel: nvme_read_ana_log+0x3a/0x100 [nvme_core] kernel: nvme_ana_work+0x15/0x20 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x4d/0x400 kernel: kthread+0x104/0x140 kernel: ? process_one_work+0x380/0x380 kernel: ? kthread_park+0x80/0x80 kernel: ret_from_fork+0x35/0x40 Fixes: 0d0b660f214d ("nvme: add ANA support") Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24nvme-multipath: set bdi capabilities onceKeith Busch
commit b2ce4d90690bd29ce5b554e203cd03682dd59697 upstream. The queues' backing device info capabilities don't change with each namespace revalidation. Set it only when each path's request_queue is initially added to a multipath queue. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> [PG: use v5.4.51-stable version instead of mainline.] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24s390/debug: avoid kernel warning on too large number of pagesChristian Borntraeger
commit 827c4913923e0b441ba07ba4cc41e01181102303 upstream. When specifying insanely large debug buffers a kernel warning is printed. The debug code does handle the error gracefully, though. Instead of duplicating the check let us silence the warning to avoid crashes when panic_on_warn is used. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24tools lib traceevent: Handle __attribute__((user)) in field namesSteven Rostedt (VMware)
commit 74621d929d944529a5e2878a84f48bfa6fb69a66 upstream. Commit c61f13eaa1ee1 ("gcc-plugins: Add structleak for more stack initialization") added "__attribute__((user))" to the user when stackleak detector is enabled. This now appears in the field format of system call trace events for system calls that have user buffers. The "__attribute__((user))" breaks the parsing in libtraceevent. That needs to be handled. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kees Kook <keescook@chromium.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-mm@kvack.org Cc: linux-trace-devel@vger.kernel.org Link: http://lore.kernel.org/lkml/20200324200956.663647256@goodmis.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24tools lib traceevent: Add append() function helper for appending stringsSteven Rostedt (VMware)
commit 27d4d336f2872193e90ee5450559e1699fae0f6d upstream. There's several locations that open code realloc and strcat() to append text to strings. Add an append() function that takes a delimiter and a string to append to another string. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jaewon Lim <jaewon31.kim@samsung.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kees Kook <keescook@chromium.org> Cc: linux-mm@kvack.org Cc: linux-trace-devel@vger.kernel.org Cc: Namhyung Kim <namhyung@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lore.kernel.org/lkml/20200324200956.515118403@goodmis.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24rxrpc: Fix race between incoming ACK parser and retransmitterDavid Howells
commit 2ad6691d988c0c611362ddc2aad89e0fb50e3261 upstream. There's a race between the retransmission code and the received ACK parser. The problem is that the retransmission loop has to drop the lock under which it is iterating through the transmission buffer in order to transmit a packet, but whilst the lock is dropped, the ACK parser can crank the Tx window round and discard the packets from the buffer. The retransmission code then updated the annotations for the wrong packet and a later retransmission thought it had to retransmit a packet that wasn't there, leading to a NULL pointer dereference. Fix this by: (1) Moving the annotation change to before we drop the lock prior to transmission. This means we can't vary the annotation depending on the outcome of the transmission, but that's fine - we'll retransmit again later if it failed now. (2) Skipping the packet if the skb pointer is NULL. The following oops was seen: BUG: kernel NULL pointer dereference, address: 000000000000002d Workqueue: krxrpcd rxrpc_process_call RIP: 0010:rxrpc_get_skb+0x14/0x8a ... Call Trace: rxrpc_resend+0x331/0x41e ? get_vtime_delta+0x13/0x20 rxrpc_process_call+0x3c0/0x4ac process_one_work+0x18f/0x27f worker_thread+0x1a3/0x247 ? create_worker+0x17d/0x17d kthread+0xe6/0xeb ? kthread_delayed_work_timer_fn+0x83/0x83 ret_from_fork+0x1f/0x30 Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24mm/slub: fix stack overruns with SLUB_STATSQian Cai
commit a68ee0573991e90af2f1785db309206408bad3e5 upstream. There is no need to copy SLUB_STATS items from root memcg cache to new memcg cache copies. Doing so could result in stack overruns because the store function only accepts 0 to clear the stat and returns an error for everything else while the show method would print out the whole stat. Then, the mismatch of the lengths returns from show and store methods happens in memcg_propagate_slab_attrs(): else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf)) buf = mbuf; max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64] in show_stat() later where a bounch of sprintf() would overrun the stack variable. Fix it by always allocating a page of buffer to be used in show_stat() if SLUB_STATS=y which should only be used for debug purpose. # echo 1 > /sys/kernel/slab/fs_cache/shrink BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0 Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func Call Trace: number+0x421/0x6e0 vsnprintf+0x451/0x8e0 sprintf+0x9e/0xd0 show_stat+0x124/0x1d0 alloc_slowpath_show+0x13/0x20 __kmem_cache_create+0x47a/0x6b0 addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame: process_one_work+0x0/0xb90 this frame has 1 object: [32, 72) 'lockdep_map' Memory state around the buggy address: ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 ^ ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00 ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0 Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func Call Trace: __kmem_cache_create+0x6ac/0x6b0 Fixes: 107dab5c92d5 ("slub: slub-specific propagation changes") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Glauber Costa <glauber@scylladb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24mm/slub.c: fix corrupted freechain in deactivate_slab()Dongli Zhang
commit 52f23478081ae0dcdb95d1650ea1e7d52d586829 upstream. The slub_debug is able to fix the corrupted slab freelist/page. However, alloc_debug_processing() only checks the validity of current and next freepointer during allocation path. As a result, once some objects have their freepointers corrupted, deactivate_slab() may lead to page fault. Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128 slub_nomerge'. The test kernel corrupts the freepointer of one free object on purpose. Unfortunately, deactivate_slab() does not detect it when iterating the freechain. BUG: unable to handle page fault for address: 00000000123456f8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI ... ... RIP: 0010:deactivate_slab.isra.92+0xed/0x490 ... ... Call Trace: ___slab_alloc+0x536/0x570 __slab_alloc+0x17/0x30 __kmalloc+0x1d9/0x200 ext4_htree_store_dirent+0x30/0xf0 htree_dirblock_to_tree+0xcb/0x1c0 ext4_htree_fill_tree+0x1bc/0x2d0 ext4_readdir+0x54f/0x920 iterate_dir+0x88/0x190 __x64_sys_getdents+0xa6/0x140 do_syscall_64+0x49/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Therefore, this patch adds extra consistency check in deactivate_slab(). Once an object's freepointer is corrupted, all following objects starting at this object are isolated. [akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n] Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joe Jin <joe.jin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24sched/debug: Make sd->flags sysctl read-onlyValentin Schneider
commit 9818427c6270a9ce8c52c8621026fe9cebae0f92 upstream. Writing to the sysctl of a sched_domain->flags directly updates the value of the field, and goes nowhere near update_top_cache_domain(). This means that the cached domain pointers can end up containing stale data (e.g. the domain pointed to doesn't have the relevant flag set anymore). Explicit domain walks that check for flags will be affected by the write, but this won't be in sync with the cached pointers which will still point to the domains that were cached at the last sched_domain build. In other words, writing to this interface is playing a dangerous game. It could be made to trigger an update of the cached sched_domain pointers when written to, but this does not seem to be worth the trouble. Make it read-only. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200415210512.805-3-valentin.schneider@arm.com Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24usbnet: smsc95xx: Fix use-after-free after removalTuomas Tynkkynen
commit b835a71ef64a61383c414d6bf2896d2c0161deca upstream. Syzbot reports an use-after-free in workqueue context: BUG: KASAN: use-after-free in mutex_unlock+0x19/0x40 kernel/locking/mutex.c:737 mutex_unlock+0x19/0x40 kernel/locking/mutex.c:737 __smsc95xx_mdio_read drivers/net/usb/smsc95xx.c:217 [inline] smsc95xx_mdio_read+0x583/0x870 drivers/net/usb/smsc95xx.c:278 check_carrier+0xd1/0x2e0 drivers/net/usb/smsc95xx.c:644 process_one_work+0x777/0xf90 kernel/workqueue.c:2274 worker_thread+0xa8f/0x1430 kernel/workqueue.c:2420 kthread+0x2df/0x300 kernel/kthread.c:255 It looks like that smsc95xx_unbind() is freeing the structures that are still in use by the concurrently running workqueue callback. Thus switch to using cancel_delayed_work_sync() to ensure the work callback really is no longer active. Reported-by: syzbot+29dc7d4ae19b703ff947@syzkaller.appspotmail.com Signed-off-by: Tuomas Tynkkynen <tuomas.tynkkynen@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24EDAC/amd64: Read back the scrub rate PCI register on F15hBorislav Petkov
commit ee470bb25d0dcdf126f586ec0ae6dca66cb340a4 upstream. Commit: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") added support for F15h, model 0x60 CPUs but in doing so, missed to read back SCRCTRL PCI config register on F15h CPUs which are *not* model 0x60. Add that read so that doing $ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate can show the previously set DRAM scrub rate. Fixes: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") Reported-by: Anders Andersson <pipatron@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> #v4.4.. Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-24mm: fix swap cache node allocation maskHugh Dickins
commit 243bce09c91b0145aeaedd5afba799d81841c030 upstream. Chris Murphy reports that a slightly overcommitted load, testing swap and zram along with i915, splats and keeps on splatting, when it had better fail less noisily: gnome-shell: page allocation failure: order:0, mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1 Call Trace: dump_stack+0x64/0x88 warn_alloc.cold+0x75/0xd9 __alloc_pages_slowpath.constprop.0+0xcfa/0xd30 __alloc_pages_nodemask+0x2df/0x320 alloc_slab_page+0x195/0x310 allocate_slab+0x3c5/0x440 ___slab_alloc+0x40c/0x5f0 __slab_alloc+0x1c/0x30 kmem_cache_alloc+0x20e/0x220 xas_nomem+0x28/0x70 add_to_swap_cache+0x321/0x400 __read_swap_cache_async+0x105/0x240 swap_cluster_readahead+0x22c/0x2e0 shmem_swapin+0x8e/0xc0 shmem_swapin_page+0x196/0x740 shmem_getpage_gfp+0x3a2/0xa60 shmem_read_mapping_page_gfp+0x32/0x60 shmem_get_pages+0x155/0x5e0 [i915] __i915_gem_object_get_pages+0x68/0xa0 [i915] i915_vma_pin+0x3fe/0x6c0 [i915] eb_add_vma+0x10b/0x2c0 [i915] i915_gem_do_execbuffer+0x704/0x3430 [i915] i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915] drm_ioctl_kernel+0x86/0xd0 [drm] drm_ioctl+0x206/0x390 [drm] ksys_ioctl+0x82/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5b/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported on 5.7, but it goes back really to 3.1: when shmem_read_mapping_page_gfp() was implemented for use by i915, and allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but missed swapin's "& GFP_KERNEL" mask for page tree node allocation in __read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits from what page cache uses, but GFP_RECLAIM_MASK is now what's needed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=208085 Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils Fixes: 68da9f055755 ("tmpfs: pass gfp to shmem_getpage_gfp") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Chris Murphy <lists@colorremedies.com> Analyzed-by: Vlastimil Babka <vbabka@suse.cz> Analyzed-by: Matthew Wilcox <willy@infradead.org> Tested-by: Chris Murphy <lists@colorremedies.com> Cc: <stable@vger.kernel.org> [3.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16Linux 5.2.50v5.2.50Paul Gortmaker
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16nbd: Fix memory leak in nbd_add_socketZheng Bin
commit 579dd91ab3a5446b148e7f179b6596b270dace46 upstream. When adding first socket to nbd, if nsock's allocation failed, the data structure member "config->socks" was reallocated, but the data structure member "config->num_connections" was not updated. A memory leak will occur then because the function "nbd_config_put" will free "config->socks" only when "config->num_connections" is not zero. Fixes: 03bf73c315ed ("nbd: prevent memory leak") Reported-by: syzbot+934037347002901b8d2a@syzkaller.appspotmail.com Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16nbd: fix crash when the blksize is zeroXiubo Li
commit 553768d1169a48c0cd87c4eb4ab57534ee663415 upstream. This will allow the blksize to be set zero and then use 1024 as default. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Xiubo Li <xiubli@redhat.com> [fix to use goto out instead of return in genl_connect] Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16ip6_gre: fix null-ptr-deref in ip6gre_init_net()Wei Yongjun
commit 46ef5b89ec0ecf290d74c4aee844f063933c4da4 upstream. KASAN report null-ptr-deref error when register_netdev() failed: KASAN: null-ptr-deref in range [0x00000000000003c0-0x00000000000003c7] CPU: 2 PID: 422 Comm: ip Not tainted 5.8.0-rc4+ #12 Call Trace: ip6gre_init_net+0x4ab/0x580 ? ip6gre_tunnel_uninit+0x3f0/0x3f0 ops_init+0xa8/0x3c0 setup_net+0x2de/0x7e0 ? rcu_read_lock_bh_held+0xb0/0xb0 ? ops_init+0x3c0/0x3c0 ? kasan_unpoison_shadow+0x33/0x40 ? __kasan_kmalloc.constprop.0+0xc2/0xd0 copy_net_ns+0x27d/0x530 create_new_namespaces+0x382/0xa30 unshare_nsproxy_namespaces+0xa1/0x1d0 ksys_unshare+0x39c/0x780 ? walk_process_tree+0x2a0/0x2a0 ? trace_hardirqs_on+0x4a/0x1b0 ? _raw_spin_unlock_irq+0x1f/0x30 ? syscall_trace_enter+0x1a7/0x330 ? do_syscall_64+0x1c/0xa0 __x64_sys_unshare+0x2d/0x40 do_syscall_64+0x56/0xa0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ip6gre_tunnel_uninit() has set 'ign->fb_tunnel_dev' to NULL, later access to ign->fb_tunnel_dev cause null-ptr-deref. Fix it by saving 'ign->fb_tunnel_dev' to local variable ndev. Fixes: dafabb6590cb ("ip6_gre: fix use-after-free in ip6gre_tunnel_lookup()") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16Revert "tty: hvc: Fix data abort due to race in hvc_open"Greg Kroah-Hartman
commit cf9c94456ebafc6d75a834e58dfdc8ae71a3acbc upstream. This reverts commit e2bd1dcbe1aa34ff5570b3427c530e4332ecf0fe. In discussion on the mailing list, it has been determined that this is not the correct type of fix for this issue. Revert it so that we can do this correctly. Reported-by: Jiri Slaby <jslaby@suse.cz> Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20200428032601.22127-1-rananta@codeaurora.org Cc: Raghavendra Rao Ananta <rananta@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16dm writecache: add cond_resched to loop in persistent_memory_claim()Mikulas Patocka
commit d35bd764e6899a7bea71958f08d16cea5bfa1919 upstream. Add cond_resched() to a loop that fills in the mapper memory area because the loop can be executed many times. Fixes: 48debafe4f2fe ("dm: add writecache target") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16dm writecache: correct uncommitted_block when discarding uncommitted entryHuaisheng Ye
commit 39495b12ef1cf602e6abd350dce2ef4199906531 upstream. When uncommitted entry has been discarded, correct wc->uncommitted_block for getting the exact number. Fixes: 48debafe4f2fe ("dm: add writecache target") Cc: stable@vger.kernel.org Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16xprtrdma: Fix handling of RDMA_ERROR repliesChuck Lever
commit 7b2182ec381f8ea15c7eb1266d6b5d7da620ad93 upstream. The RPC client currently doesn't handle ERR_CHUNK replies correctly. rpcrdma_complete_rqst() incorrectly passes a negative number to xprt_complete_rqst() as the number of bytes copied. Instead, set task->tk_status to the error value, and return zero bytes copied. In these cases, return -EIO rather than -EREMOTEIO. The RPC client's finite state machine doesn't know what to do with -EREMOTEIO. Additional clean ups: - Don't double-count RDMA_ERROR replies - Remove a stale comment Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@kernel.vger.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16NFSv4 fix CLOSE not waiting for direct IO compeletionOlga Kornievskaia
commit d03727b248d0dae6199569a8d7b629a681154633 upstream. Figuring out the root case for the REMOVE/CLOSE race and suggesting the solution was done by Neil Brown. Currently what happens is that direct IO calls hold a reference on the open context which is decremented as an asynchronous task in the nfs_direct_complete(). Before reference is decremented, control is returned to the application which is free to close the file. When close is being processed, it decrements its reference on the open_context but since directIO still holds one, it doesn't sent a close on the wire. It returns control to the application which is free to do other operations. For instance, it can delete a file. Direct IO is finally releasing its reference and triggering an asynchronous close. Which races with the REMOVE. On the server, REMOVE can be processed before the CLOSE, failing the REMOVE with EACCES as the file is still opened. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Suggested-by: Neil Brown <neilb@suse.com> CC: stable@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16pNFS/flexfiles: Fix list corruption if the mirror count changesTrond Myklebust
commit 8b04013737341442ed914b336cde866b902664ae upstream. If the mirror count changes in the new layout we pick up inside ff_layout_pg_init_write(), then we can end up adding the request to the wrong mirror and corrupting the mirror->pg_list. Fixes: d600ad1f2bdb ("NFS41: pop some layoutget errors to application") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()Chuck Lever
commit 89a3c9f5b9f0bcaa9aea3e8b2a616fcaea9aad78 upstream. @subbuf is an output parameter of xdr_buf_subsegment(). A survey of call sites shows that @subbuf is always uninitialized before xdr_buf_segment() is invoked by callers. There are some execution paths through xdr_buf_subsegment() that do not set all of the fields in @subbuf, leaving some pointer fields containing garbage addresses. Subsequent processing of that buffer then results in a page fault. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16sunrpc: fixed rollback in rpc_gssd_dummy_populate()Vasily Averin
commit b7ade38165ca0001c5a3bd5314a314abbbfbb1b7 upstream. __rpc_depopulate(gssd_dentry) was lost on error path cc: stable@vger.kernel.org Fixes: commit 4b9a445e3eeb ("sunrpc: create a new dummy pipe for gssd to hold open") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16Staging: rtl8723bs: prevent buffer overflow in update_sta_support_rate()Dan Carpenter
commit b65a2d8c8614386f7e8d38ea150749f8a862f431 upstream. The "ie_len" variable is in the 0-255 range and it comes from the network. If it's over NDIS_802_11_LENGTH_RATES_EX (16) then that will lead to memory corruption. Fixes: 554c0a3abf21 ("staging: Add rtl8723bs sdio wifi driver") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20200603101958.GA1845750@mwanda Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16drm/radeon: fix fb_div check in ni_init_smc_spll_table()Denis Efremov
commit 35f760b44b1b9cb16a306bdcc7220fbbf78c4789 upstream. clk_s is checked twice in a row in ni_init_smc_spll_table(). fb_div should be checked instead. Fixes: 69e0b57a91ad ("drm/radeon/kms: add dpm support for cayman (v5)") Cc: stable@vger.kernel.org Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16drm: rcar-du: Fix build errorDaniel Gomez
commit 5f9af404eec82981c4345c9943be48422234e7ab upstream. Select DRM_KMS_HELPER dependency. Build error when DRM_KMS_HELPER is not selected: drivers/gpu/drm/rcar-du/rcar_lvds.o:(.rodata+0xd48): undefined reference to `drm_atomic_helper_bridge_duplicate_state' drivers/gpu/drm/rcar-du/rcar_lvds.o:(.rodata+0xd50): undefined reference to `drm_atomic_helper_bridge_destroy_state' drivers/gpu/drm/rcar-du/rcar_lvds.o:(.rodata+0xd70): undefined reference to `drm_atomic_helper_bridge_reset' drivers/gpu/drm/rcar-du/rcar_lvds.o:(.rodata+0xdc8): undefined reference to `drm_atomic_helper_connector_reset' drivers/gpu/drm/rcar-du/rcar_lvds.o:(.rodata+0xde0): undefined reference to `drm_helper_probe_single_connector_modes' drivers/gpu/drm/rcar-du/rcar_lvds.o:(.rodata+0xe08): undefined reference to `drm_atomic_helper_connector_duplicate_state' drivers/gpu/drm/rcar-du/rcar_lvds.o:(.rodata+0xe10): undefined reference to `drm_atomic_helper_connector_destroy_state' Fixes: c6a27fa41fab ("drm: rcar-du: Convert LVDS encoder code to bridge driver") Cc: <stable@vger.kernel.org> Signed-off-by: Daniel Gomez <dagmcr@gmail.com> Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com> Reviewed-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16ring-buffer: Zero out time extend if it is nested and not absoluteSteven Rostedt (VMware)
commit 097350d1c6e1f5808cae142006f18a0bbc57018d upstream. Currently the ring buffer makes events that happen in interrupts that preempt another event have a delta of zero. (Hopefully we can change this soon). But this is to deal with the races of updating a global counter with lockless and nesting functions updating deltas. With the addition of absolute time stamps, the time extend didn't follow this rule. A time extend can happen if two events happen longer than 2^27 nanoseconds appart, as the delta time field in each event is only 27 bits. If that happens, then a time extend is injected with 2^59 bits of nanoseconds to use (18 years). But if the 2^27 nanoseconds happen between two events, and as it is writing the event, an interrupt triggers, it will see the 2^27 difference as well and inject a time extend of its own. But a recent change made the time extend logic not take into account the nesting, and this can cause two time extend deltas to happen moving the time stamp much further ahead than the current time. This gets all reset when the ring buffer moves to the next page, but that can cause time to appear to go backwards. This was observed in a trace-cmd recording, and since the data is saved in a file, with trace-cmd report --debug, it was possible to see that this indeed did happen! bash-52501 110d... 81778.908247: sched_switch: bash:52501 [120] S ==> swapper/110:0 [120] [12770284:0x2e8:64] <idle>-0 110d... 81778.908757: sched_switch: swapper/110:0 [120] R ==> bash:52501 [120] [509947:0x32c:64] TIME EXTEND: delta:306454770 length:0 bash-52501 110.... 81779.215212: sched_swap_numa: src_pid=52501 src_tgid=52388 src_ngid=52501 src_cpu=110 src_nid=2 dst_pid=52509 dst_tgid=52388 dst_ngid=52501 dst_cpu=49 dst_nid=1 [0:0x378:48] TIME EXTEND: delta:306458165 length:0 bash-52501 110dNh. 81779.521670: sched_wakeup: migration/110:565 [0] success=1 CPU:110 [0:0x3b4:40] and at the next page, caused the time to go backwards: bash-52504 110d... 81779.685411: sched_switch: bash:52504 [120] S ==> swapper/110:0 [120] [8347057:0xfb4:64] CPU:110 [SUBBUFFER START] [81779379165886:0x1320000] <idle>-0 110dN.. 81779.379166: sched_wakeup: bash:52504 [120] success=1 CPU:110 [0:0x10:40] <idle>-0 110d... 81779.379167: sched_switch: swapper/110:0 [120] R ==> bash:52504 [120] [1168:0x3c:64] Link: https://lkml.kernel.org/r/20200622151815.345d1bf5@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: dc4e2801d400b ("ring-buffer: Redefine the unimplemented RINGBUF_TYPE_TIME_STAMP") Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16tracing: Fix event trigger to accept redundant spacesMasami Hiramatsu
commit 6784beada631800f2c5afd567e5628c843362cee upstream. Fix the event trigger to accept redundant spaces in the trigger input. For example, these return -EINVAL echo " traceon" > events/ftrace/print/trigger echo "traceon if common_pid == 0" > events/ftrace/print/trigger echo "disable_event:kmem:kmalloc " > events/ftrace/print/trigger But these are hard to find what is wrong. To fix this issue, use skip_spaces() to remove spaces in front of actual tokens, and set NULL if there is no token. Link: http://lkml.kernel.org/r/159262476352.185015.5261566783045364186.stgit@devnote2 Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: 85f2b08268c0 ("tracing: Add basic event trigger framework") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16arm64: perf: Report the PC value in REGS_ABI_32 modeJiping Ma
commit 8dfe804a4031ca6ba3a3efb2048534249b64f3a5 upstream. A 32-bit perf querying the registers of a compat task using REGS_ABI_32 will receive zeroes from w15, when it expects to find the PC. Return the PC value for register dwarf register 15 when returning register values for a compat task to perf. Cc: <stable@vger.kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Jiping Ma <jiping.ma2@windriver.com> Link: https://lore.kernel.org/r/1589165527-188401-1-git-send-email-jiping.ma2@windriver.com [will: Shuffled code and added a comment] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16ocfs2: fix panic on nfs server over ocfs2Junxiao Bi
commit e5a15e17a78d58f933d17cafedfcf7486a29f5b4 upstream. The following kernel panic was captured when running nfs server over ocfs2, at that time ocfs2_test_inode_bit() was checking whether one inode locating at "blkno" 5 was valid, that is ocfs2 root inode, its "suballoc_slot" was OCFS2_INVALID_SLOT(65535) and it was allocted from //global_inode_alloc, but here it wrongly assumed that it was got from per slot inode alloctor which would cause array overflow and trigger kernel panic. BUG: unable to handle kernel paging request at 0000000000001088 IP: [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0 PGD 1e06ba067 PUD 1e9e7d067 PMD 0 Oops: 0002 [#1] SMP CPU: 6 PID: 24873 Comm: nfsd Not tainted 4.1.12-124.36.1.el6uek.x86_64 #2 Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.87 02/02/2018 RIP: _raw_spin_lock+0x18/0xf0 RSP: e02b:ffff88005ae97908 EFLAGS: 00010206 RAX: ffff88005ae98000 RBX: 0000000000001088 RCX: 0000000000000000 RDX: 0000000000020000 RSI: 0000000000000009 RDI: 0000000000001088 RBP: ffff88005ae97928 R08: 0000000000000000 R09: ffff880212878e00 R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000000001088 R13: ffff8800063c0aa8 R14: ffff8800650c27d0 R15: 000000000000ffff FS: 0000000000000000(0000) GS:ffff880218180000(0000) knlGS:ffff880218180000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000001088 CR3: 00000002033d0000 CR4: 0000000000042660 Call Trace: igrab+0x1e/0x60 ocfs2_get_system_file_inode+0x63/0x3a0 [ocfs2] ocfs2_test_inode_bit+0x328/0xa00 [ocfs2] ocfs2_get_parent+0xba/0x3e0 [ocfs2] reconnect_path+0xb5/0x300 exportfs_decode_fh+0xf6/0x2b0 fh_verify+0x350/0x660 [nfsd] nfsd4_putfh+0x4d/0x60 [nfsd] nfsd4_proc_compound+0x3d3/0x6f0 [nfsd] nfsd_dispatch+0xe0/0x290 [nfsd] svc_process_common+0x412/0x6a0 [sunrpc] svc_process+0x123/0x210 [sunrpc] nfsd+0xff/0x170 [nfsd] kthread+0xcb/0xf0 ret_from_fork+0x61/0x90 Code: 83 c2 02 0f b7 f2 e8 18 dc 91 ff 66 90 eb bf 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 48 89 fb ba 00 00 02 00 <f0> 0f c1 17 89 d0 45 31 e4 45 31 ed c1 e8 10 66 39 d0 41 89 c6 RIP _raw_spin_lock+0x18/0xf0 CR2: 0000000000001088 ---[ end trace 7264463cd1aac8f9 ]--- Kernel panic - not syncing: Fatal exception Link: http://lkml.kernel.org/r/20200616183829.87211-4-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16ocfs2: fix value of OCFS2_INVALID_SLOTJunxiao Bi
commit 9277f8334ffc719fe922d776444d6e4e884dbf30 upstream. In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2 implementation, slot number is 32 bits. Usually this will not cause any issue, because slot number is converted from u16 to u32, but OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from disk was obtained, its value was (u16)-1, and it was converted to u32. Then the following checking in get_local_system_inode will be always skipped: static struct inode **get_local_system_inode(struct ocfs2_super *osb, int type, u32 slot) { BUG_ON(slot == OCFS2_INVALID_SLOT); ... } Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16ocfs2: load global_inode_allocJunxiao Bi
commit 7569d3c754e452769a5747eeeba488179e38a5da upstream. Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will make it load during mount. It can be used to test whether some global/system inodes are valid. One use case is that nfsd will test whether root inode is valid. Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16ocfs2: avoid inode removal while nfsd is accessing itJunxiao Bi
commit 4cd9973f9ff69e37dd0ba2bd6e6423f8179c329a upstream. Patch series "ocfs2: fix nfsd over ocfs2 issues", v2. This is a series of patches to fix issues on nfsd over ocfs2. patch 1 is to avoid inode removed while nfsd access it patch 2 & 3 is to fix a panic issue. This patch (of 4): When nfsd is getting file dentry using handle or parent dentry of some dentry, one cluster lock is used to avoid inode removed from other node, but it still could be removed from local node, so use a rw lock to avoid this. Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16mm/slab: use memzero_explicit() in kzfree()Waiman Long
commit 8982ae527fbef170ef298650c15d55a9ccd33973 upstream. The kzfree() function is normally used to clear some sensitive information, like encryption keys, in the buffer before freeing it back to the pool. Memset() is currently used for buffer clearing. However unlikely, there is still a non-zero probability that the compiler may choose to optimize away the memory clearing especially if LTO is being used in the future. To make sure that this optimization will never happen, memzero_explicit(), which is introduced in v3.18, is now used in kzfree() to future-proof it. Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Howells <dhowells@redhat.com> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Jason A . Donenfeld" <Jason@zx2c4.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eofFilipe Manana
commit 4b1946284dd6641afdb9457101056d9e6ee6204c upstream. If we attempt to write to prealloc extent located after eof using a RWF_NOWAIT write, we always fail with -EAGAIN. We do actually check if we have an allocated extent for the write at the start of btrfs_file_write_iter() through a call to check_can_nocow(), but later when we go into the actual direct IO write path we simply return -EAGAIN if the write starts at or beyond EOF. Trivial to reproduce: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/foo $ chattr +C /mnt/foo $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec) $ xfs_io -c "falloc -k 64K 1M" /mnt/foo $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo pwrite: Resource temporarily unavailable On xfs and ext4 the write succeeds, as expected. Fix this by removing the wrong check at btrfs_direct_IO(). Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-07-16btrfs: fix data block group relocation failure due to concurrent scrubFilipe Manana
commit 432cd2a10f1c10cead91fe706ff5dc52f06d642a upstream. When running relocation of a data block group while scrub is running in parallel, it is possible that the relocation will fail and abort the current transaction with an -EINVAL error: [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents [134243.999871] ------------[ cut here ]------------ [134244.000741] BTRFS: Transaction aborted (error -22) [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.017151] Code: 48 c7 c7 (...) [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286 [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000 [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001 [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001 [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08 [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000 [134244.028024] FS: 00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000 [134244.029491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0 [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [134244.034484] Call Trace: [134244.034984] btrfs_cow_block+0x12b/0x2b0 [btrfs] [134244.035859] do_relocation+0x30b/0x790 [btrfs] [134244.036681] ? do_raw_spin_unlock+0x49/0xc0 [134244.037460] ? _raw_spin_unlock+0x29/0x40 [134244.038235] relocate_tree_blocks+0x37b/0x730 [btrfs] [134244.039245] relocate_block_group+0x388/0x770 [btrfs] [134244.040228] btrfs_relocate_block_group+0x161/0x2e0 [btrfs] [134244.041323] btrfs_relocate_chunk+0x36/0x110 [btrfs] [134244.041345] btrfs_balance+0xc06/0x1860 [btrfs] [134244.043382] ? btrfs_ioctl_balance+0x27c/0x310 [btrfs] [134244.045586] btrfs_ioctl_balance+0x1ed/0x310 [btrfs] [134244.045611] btrfs_ioctl+0x1880/0x3760 [btrfs] [134244.049043] ? do_raw_spin_unlock+0x49/0xc0 [134244.049838] ? _raw_spin_unlock+0x29/0x40 [134244.050587] ? __handle_mm_fault+0x11b3/0x14b0 [134244.051417] ? ksys_ioctl+0x92/0xb0 [134244.052070] ksys_ioctl+0x92/0xb0 [134244.052701] ? trace_hardirqs_off_thunk+0x1a/0x1c [134244.053511] __x64_sys_ioctl+0x16/0x20 [134244.054206] do_syscall_64+0x5c/0x280 [134244.054891] entry_SYSCALL_64_after_hwframe+0x49/0xbe [134244.055819] RIP: 0033:0x7f29b51c9dd7 [134244.056491] Code: 00 00 00 (...) [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7 [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003 [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000 [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0 [134244.067626] irq event stamp: 0 [134244.068202] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.070909] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0 [134244.073432] ---[ end trace bd7c03622e0b0a99 ]--- The -EINVAL error comes from the following chain of function calls: __btrfs_cow_block() <-- aborts the transaction btrfs_reloc_cow_block() replace_file_extents() get_new_location() <-- returns -EINVAL When relocating a data block group, for each allocated extent of the block group, we preallocate another extent (at prealloc_file_extent_cluster()), associated with the data relocation inode, and then dirty all its pages. These preallocated extents have, and must have, the same size that extents from the data block group being relocated have. Later before we start the relocation stage that updates pointers (bytenr field of file extent items) to point to the the new extents, we trigger writeback for the data relocation inode. The expectation is that writeback will write the pages to the previously preallocated extents, that it follows the NOCOW path. That is generally the case, however, if a scrub is running it may have turned the block group that contains those extents into RO mode, in which case writeback falls back to the COW path. However in the COW path instead of allocating exactly one extent with the expected size, the allocator may end up allocating several smaller extents due to free space fragmentation - because we tell it at cow_file_range() that the minimum allocation size can match the filesystem's sector size. This later breaks the relocation's expectation that an extent associated to a file extent item in the data relocation inode has the same size as the respective extent pointed by a file extent item in another tree - in this case the extent to which the relocation inode poins to is smaller, causing relocation.c:get_new_location() to return -EINVAL. For example, if we are relocating a data block group X that has a logical address of X and the block group has an extent allocated at the logical address X + 128KiB with a size of 64KiB: 1) At prealloc_file_extent_cluster() we allocate an extent for the data relocation inode with a size of 64KiB and associate it to the file offset 128KiB (X + 128KiB - X) of the data relocation inode. This preallocated extent was allocated at block group Z; 2) A scrub running in parallel turns block group Z into RO mode and starts scrubing its extents; 3) Relocation triggers writeback for the data relocation inode; 4) When running delalloc (btrfs_run_delalloc_range()), we try first the NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC set in its flags. However, because block group Z is in RO mode, the NOCOW path (run_delalloc_nocow()) falls back into the COW path, by calling cow_file_range(); 5) At cow_file_range(), in the first iteration of the while loop we call btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum allocation size of 4KiB (fs_info->sectorsize). Due to free space fragmentation, btrfs_reserve_extent() ends up allocating two extents of 32KiB each, each one on a different iteration of that while loop; 6) Writeback of the data relocation inode completes; 7) Relocation proceeds and ends up at relocation.c:replace_file_extents(), with a leaf which has a file extent item that points to the data extent from block group X, that has a logical address (bytenr) of X + 128KiB and a size of 64KiB. Then it calls get_new_location(), which does a lookup in the data relocation tree for a file extent item starting at offset 128KiB (X + 128KiB - X) and belonging to the data relocation inode. It finds a corresponding file extent item, however that item points to an extent that has a size of 32KiB, which doesn't match the expected size of 64KiB, resuling in -EINVAL being returned from this function and propagated up to __btrfs_cow_block(), which aborts the current transaction. To fix this make sure that at cow_file_range() when we call the allocator we pass it a minimum allocation size corresponding the desired extent size if the inode belongs to the data relocation tree, otherwise pass it the filesystem's sector size as the minimum allocation size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>