summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2018-03-31sysfs: symlink: export sysfs_create_link_nowarn()Grygorii Strashko
[ Upstream commit 2399ac42e762ab25c58420e25359b2921afdc55f ] The sysfs_create_link_nowarn() is going to be used in phylib framework in subsequent patch which can be built as module. Hence, export sysfs_create_link_nowarn() to avoid build errors. Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Fixes: a3995460491d ("net: phy: Relax error checking on sysfs_create_link()") Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28staging: ncpfs: memory corruption in ncp_read_kernel()Dan Carpenter
commit 4c41aa24baa4ed338241d05494f2c595c885af8f upstream. If the server is malicious then *bytes_read could be larger than the size of the "target" buffer. It would lead to memory corruption when we do the memcpy(). Reported-by: Dr Silvio Cesare of InfoSect <Silvio Cesare <silvio.cesare@gmail.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28hugetlbfs: check for pgoff value overflowMike Kravetz
commit 63489f8e821144000e0bdca7e65a8d1cc23a7ee7 upstream. A vma with vm_pgoff large enough to overflow a loff_t type when converted to a byte offset can be passed via the remap_file_pages system call. The hugetlbfs mmap routine uses the byte offset to calculate reservations and file size. A sequence such as: mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0); remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0); will result in the following when task exits/file closed, kernel BUG at mm/hugetlb.c:749! Call Trace: hugetlbfs_evict_inode+0x2f/0x40 evict+0xcb/0x190 __dentry_kill+0xcb/0x150 __fput+0x164/0x1e0 task_work_run+0x84/0xa0 exit_to_usermode_loop+0x7d/0x80 do_syscall_64+0x18b/0x190 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 The overflowed pgoff value causes hugetlbfs to try to set up a mapping with a negative range (end < start) that leaves invalid state which causes the BUG. The previous overflow fix to this code was incomplete and did not take the remap_file_pages system call into account. [mike.kravetz@oracle.com: v3] Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com [akpm@linux-foundation.org: include mmdebug.h] [akpm@linux-foundation.org: fix -ve left shift count on sh] Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Nic Losby <blurbdust@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28nfsd: remove blocked locks on client teardownJeff Layton
commit 68ef3bc3166468678d5e1fdd216628c35bd1186f upstream. We had some reports of panics in nfsd4_lm_notify, and that showed a nfs4_lockowner that had outlived its so_client. Ensure that we walk any leftover lockowners after tearing down all of the stateids, and remove any blocked locks that they hold. With this change, we also don't need to walk the nbl_lru on nfsd_net shutdown, as that will happen naturally when we tear down the clients. Fixes: 76d348fadff5 (nfsd: have nfsd4_lock use blocking locks for v4.1+ locks) Reported-by: Frank Sorenson <fsorenso@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org # 4.9 Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24nfsd4: permit layoutget of executable-only filesBenjamin Coddington
[ Upstream commit 66282ec1cf004c09083c29cb5e49019037937bbd ] Clients must be able to read a file in order to execute it, and for pNFS that means the client needs to be able to perform a LAYOUTGET on the file. This behavior for executable-only files was added for OPEN in commit a043226bc140 "nfsd4: permit read opens of executable-only files". This fixes up xfstests generic/126 on block/scsi layouts. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21btrfs: Fix memory barriers usage with device stats countersNikolay Borisov
commit 9deae9689231964972a94bb56a79b669f9d47ac1 upstream. Commit addc3fa74e5b ("Btrfs: Fix the problem that the dirty flag of dev stats is cleared") reworked the way device stats changes are tracked. A new atomic dev_stats_ccnt counter was introduced which is incremented every time any of the device stats counters are changed. This serves as a flag whether there are any pending stats changes. However, this patch only partially implemented the correct memory barriers necessary: - It only ordered the stores to the counters but not the reads e.g. btrfs_run_dev_stats - It completely omitted any comments documenting the intended design and how the memory barriers pair with each-other This patch provides the necessary comments as well as adds a missing smp_rmb in btrfs_run_dev_stats. Furthermore since dev_stats_cnt is only a snapshot at best there was no point in reading the counter twice - once in btrfs_dev_stats_dirty and then again when assigning stats_cnt. Just collapse both reads into 1. Fixes: addc3fa74e5b ("Btrfs: Fix the problem that the dirty flag of dev stats is cleared") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodesZygo Blaxell
commit c8195a7b1ad5648857ce20ba24f384faed8512bc upstream. Until v4.14, this warning was very infrequent: WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0 Modules linked in: [...] CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1 Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014 Call Trace: dump_stack+0x85/0xc2 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 find_parent_nodes+0xc41/0x14e0 __btrfs_find_all_roots+0xad/0x120 ? extent_same_check_offsets+0x70/0x70 iterate_extent_inodes+0x168/0x300 iterate_inodes_from_logical+0x87/0xb0 ? iterate_inodes_from_logical+0x87/0xb0 ? extent_same_check_offsets+0x70/0x70 btrfs_ioctl+0x8ac/0x2820 ? lock_acquire+0xc2/0x200 do_vfs_ioctl+0x91/0x700 ? __fget+0x112/0x200 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc6 ? trace_hardirqs_off_caller+0x1f/0x140 Starting with v4.14 (specifically 86d5f9944252 ("btrfs: convert prelimary reference tracking to use rbtrees")) the WARN_ON occurs three orders of magnitude more frequently--almost once per second while running workloads like bees. Replace the WARN_ON() with a comment rationale for its removal. The rationale is paraphrased from an explanation by Edmund Nadolski <enadolski@suse.de> on the linux-btrfs mailing list. Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()") Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21btrfs: Fix use-after-free when cleaning up fs_devs with a single stale deviceNikolay Borisov
commit fd649f10c3d21ee9d7542c609f29978bdf73ab94 upstream. Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced btrfs_free_stale_device which iterates the device lists for all registered btrfs filesystems and deletes those devices which aren't mounted. In a btrfs_devices structure has only 1 device attached to it and it is unused then btrfs_free_stale_devices will proceed to also free the btrfs_fs_devices struct itself. Currently this leads to a use after free since list_for_each_entry will try to perform a check on the already freed memory to see if it has to terminate the loop. The fix is to use 'break' when we know we are freeing the current fs_devs. Fixes: 4fde46f0cc71 ("Btrfs: free the stale device") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21btrfs: alloc_chunk: fix DUP stripe size handlingHans van Kranenburg
commit 92e222df7b8f05c565009c7383321b593eca488b upstream. In case of using DUP, we search for enough unallocated disk space on a device to hold two stripes. The devices_info[ndevs-1].max_avail that holds the amount of unallocated space found is directly assigned to stripe_size, while it's actually twice the stripe size. Later on in the code, an unconditional division of stripe_size by dev_stripes corrects the value, but in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size. The unconditional division later tries to correct stripe_size, but will actually make sure we can't allocate more than half the max_chunk_size. Fix this by moving the division by dev_stripes before the max chunk size check, so it always contains the right value, instead of putting a duct tape division in further on to get it fixed again. Since in all other cases than DUP, dev_stripes is 1, this change only affects DUP. Other attempts in the past were made to fix this: * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried to fix the same problem, but still resulted in part of the code acting on a wrongly doubled stripe_size value. * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally broke this fix again. The real problem was already introduced with the rest of the code in 73c5de0051. The user visible result however will be that the max chunk size for DUP will suddenly double, while it's actually acting according to the limits in the code again like it was 5 years ago. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation") Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6") Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21btrfs: add missing initialization in btrfs_check_sharedEdmund Nadolski
commit 18bf591ba9753e3e5ba91f38f756a800693408f4 upstream. This patch addresses an issue that causes fiemap to falsely report a shared extent. The test case is as follows: xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5 sync xfs_io -c "fiemap -v" /media/scratch/file5 which gives the resulting output: wrote 65536/65536 bytes at offset 0 64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec) /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x2001 /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 This is because btrfs_check_shared calls find_parent_nodes repeatedly in a loop, passing a share_check struct to report the count of shared extent. But btrfs_check_shared does not re-initialize the count value to zero for subsequent calls from the loop, resulting in a false share count value. This is a regressive behavior from 4.13. With proper re-initialization the test result is as follows: wrote 65536/65536 bytes at offset 0 64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec) /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 which corrects the regression. Fixes: 3ec4d3238ab ("btrfs: allow backref search checks for shared extents") Signed-off-by: Edmund Nadolski <enadolski@suse.com> [ add text from cover letter to changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21btrfs: Fix NULL pointer exception in find_bio_stripeDmitriy Gorokh
commit 047fdea6341966a0898e3b16c51f54d4f5ba030a upstream. On detaching of a disk which is a part of a RAID6 filesystem, the following kernel OOPS may happen: [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0 [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 [63122.971202] Oops: 0000 [#1] SMP [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8 [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] [63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000 [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] [63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287 [63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690 [63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000 [63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600 [63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500 [63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004 [63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000 [63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0 [63123.009969] Call Trace: [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] [63123.010251] bio_endio+0xa1/0x120 [63123.010378] generic_make_request+0x218/0x270 [63123.010921] submit_bio+0x66/0x130 [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] [63123.011759] ? ___cache_free+0x1c5/0x300 [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] [63123.012656] process_one_work+0x14d/0x350 [63123.012888] worker_thread+0x4d/0x3a0 [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 [63123.013192] kthread+0x109/0x140 [63123.013315] ? process_scheduled_works+0x40/0x40 [63123.013472] ? kthread_stop+0x110/0x110 [63123.013610] ret_from_fork+0x25/0x30 [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8 [63123.014678] CR2: 0000000000000080 [63123.016590] ---[ end trace a295ea7259c17880 ]ā€” This is reproducible in a cycle, where a series of writes is followed by SCSI device delete command. The test may take up to few minutes. Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index") [ no signed-off-by provided ] Author: Dmitriy Gorokh <Dmitriy.Gorokh@wdc.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21fs/aio: Use RCU accessors for kioctx_table->table[]Tejun Heo
commit d0264c01e7587001a8c4608a5d1818dba9a4c11a upstream. While converting ioctx index from a list to a table, db446a08c23d ("aio: convert the ioctx list to table lookup v3") missed tagging kioctx_table->table[] as an array of RCU pointers and using the appropriate RCU accessors. This introduces a small window in the lookup path where init and access may race. Mark kioctx_table->table[] with __rcu and use the approriate RCU accessors when using the field. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3") Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.12+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21fs/aio: Add explicit RCU grace period when freeing kioctxTejun Heo
commit a6d7cff472eea87d96899a20fa718d2bab7109f3 upstream. While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat") incorrectly removed explicit RCU grace period before freeing kioctx. The intention seems to be depending on the internal RCU grace periods of percpu_ref; however, percpu_ref uses a different flavor of RCU, sched-RCU. This can lead to kioctx being freed while RCU read protected dereferences are still in progress. Fix it by updating free_ioctx() to go through call_rcu() explicitly. v2: Comment added to explain double bouncing. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Fixes: e34ecee2ae79 ("aio: Fix a trinity splat") Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21lock_parent() needs to recheck if dentry got __dentry_kill'ed under itAl Viro
commit 3b821409632ab778d46e807516b457dfa72736ed upstream. In case when dentry passed to lock_parent() is protected from freeing only by the fact that it's on a shrink list and trylock of parent fails, we could get hit by __dentry_kill() (and subsequent dentry_kill(parent)) between unlocking dentry and locking presumed parent. We need to recheck that dentry is alive once we lock both it and parent *and* postpone rcu_read_unlock() until after that point. Otherwise we could return a pointer to struct dentry that already is rcu-scheduled for freeing, with ->d_lock held on it; caller's subsequent attempt to unlock it can end up with memory corruption. Cc: stable@vger.kernel.org # 3.12+, counting backports Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-21fs: Teach path_connected to handle nfs filesystems with multiple roots.Eric W. Biederman
commit 95dd77580ccd66a0da96e6d4696945b8cea39431 upstream. On nfsv2 and nfsv3 the nfs server can export subsets of the same filesystem and report the same filesystem identifier, so that the nfs client can know they are the same filesystem. The subsets can be from disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no way to find the common root of all directory trees exported form the server with the same filesystem identifier. The practical result is that in struct super s_root for nfs s_root is not necessarily the root of the filesystem. The nfs mount code sets s_root to the root of the first subset of the nfs filesystem that the kernel mounts. This effects the dcache invalidation code in generic_shutdown_super currently called shrunk_dcache_for_umount and that code for years has gone through an additional list of dentries that might be dentry trees that need to be freed to accomodate nfs. When I wrote path_connected I did not realize nfs was so special, and it's hueristic for avoiding calling is_subdir can fail. The practical case where this fails is when there is a move of a directory from the subtree exposed by one nfs mount to the subtree exposed by another nfs mount. This move can happen either locally or remotely. With the remote case requiring that the move directory be cached before the move and that after the move someone walks the path to where the move directory now exists and in so doing causes the already cached directory to be moved in the dcache through the magic of d_splice_alias. If someone whose working directory is in the move directory or a subdirectory and now starts calling .. from the initial mount of nfs (where s_root == mnt_root), then path_connected as a heuristic will not bother with the is_subdir check. As s_root really is not the root of the nfs filesystem this heuristic is wrong, and the path may actually not be connected and path_connected can fail. The is_subdir function might be cheap enough that we can call it unconditionally. Verifying that will take some benchmarking and the result may not be the same on all kernels this fix needs to be backported to. So I am avoiding that for now. Filesystems with snapshots such as nilfs and btrfs do something similar. But as the directory tree of the snapshots are disjoint from one another and from the main directory tree rename won't move things between them and this problem will not occur. Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19userns: Don't fail follow_automount based on s_user_nsEric W. Biederman
[ Upstream commit bbc3e471011417598e598707486f5d8814ec9c01 ] When vfs_submount was added the test to limit automounts from filesystems that with s_user_ns != &init_user_ns accidentially left in follow_automount. The test was never about any security concerns and was always about how do we implement this for filesystems whose s_user_ns != &init_user_ns. At the moment this check makes no difference as there are no filesystems that both set FS_USERNS_MOUNT and implement d_automount. Remove this check now while I am thinking about it so there will not be odd booby traps for someone who does want to make this combination work. vfs_submount still needs improvements to allow this combination to work, and vfs_submount contains a check that presents a warning. The autofs4 filesystem could be modified to set FS_USERNS_MOUNT and it would need not work on this code path, as userspace performs the mounts. Fixes: 93faccbbfa95 ("fs: Better permission checking for submounts") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19Revert "btrfs: use proper endianness accessors for super_copy"Greg Kroah-Hartman
This reverts commit 3c181c12c431fe33b669410d663beb9cceefcd1b as it causes breakage on big endian systems with btrfs images. Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Cc: Anand Jain <anand.jain@oracle.com> Cc: Liu Bo <bo.li.liu@oracle.com> Cc: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15NFS: Fix unstable write completionTrond Myklebust
commit c4f24df942a181699c5bab01b8e5e82b925f77f3 upstream. We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to ensure that all outstanding COMMIT requests to the inode in question are complete. Currently we may exit early from both nfs_commit_inode() and nfs_write_inode() even if there are COMMIT requests in flight, or unstable writes on the commit list. In order to get the right semantics w.r.t. sync_inode(), we don't need to have nfs_commit_inode() reset the inode dirty flags when called from nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that nfs_write_inode() leaves them in the right state if there are outstanding commits, or stable pages. Reported-by: Scott Mayhew <smayhew@redhat.com> Fixes: dc4fd9ab01ab ("nfs: don't wait on commit in nfs_commit_inode()...") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15pNFS: Prevent the layout header refcount going to zero in pnfs_roc()Trond Myklebust
commit 9c6376ebddad585da4238532dd6d90ae23ffee67 upstream. Ensure that we hold a reference to the layout header when processing the pNFS return-on-close so that the refcount value does not inadvertently go to zero. Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.10+ Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15NFS: Fix an incorrect type in struct nfs_direct_reqTrond Myklebust
commit d9ee65539d3eabd9ade46cca1780e3309ad0f907 upstream. The start offset needs to be of type loff_t. Fixed: 5fadeb47dcc5c ("nfs: count DIO good bytes correctly with mirroring") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-08direct-io: Fix sleep in atomic due to sync AIOJan Kara
commit d9c10e5b8863cfb6886d1640386455075c6e979d upstream. Commit e864f39569f4 "fs: add RWF_DSYNC aand RWF_SYNC" added additional way for direct IO to become synchronous and thus trigger fsync from the IO completion handler. Then commit 9830f4be159b "fs: Use RWF_* flags for AIO operations" allowed these flags to be set for AIO as well. However that commit forgot to update the condition checking whether the IO completion handling should be defered to a workqueue and thus AIO DIO with RWF_[D]SYNC set will call fsync() from IRQ context resulting in sleep in atomic. Fix the problem by checking directly iocb flags (the same way as it is done in dio_complete()) instead of checking all conditions that could lead to IO being synchronous. CC: Christoph Hellwig <hch@lst.de> CC: Goldwyn Rodrigues <rgoldwyn@suse.com> CC: stable@vger.kernel.org Reported-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Fixes: 9830f4be159b29399d107bffb99e0132bc5aedd4 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-08btrfs: use proper endianness accessors for super_copyAnand Jain
commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream. The fs_info::super_copy is a byte copy of the on-disk structure and all members must use the accessor macros/functions to obtain the right value. This was missing in update_super_roots and in sysfs readers. Moving between opposite endianness hosts will report bogus numbers in sysfs, and mount may fail as the root will not be restored correctly. If the filesystem is always used on a same endian host, this will not be a problem. Fix this by using the btrfs_set_super...() functions to set fs_info::super_copy values, and for the sysfs, use the cached fs_info::nodesize/sectorsize values. CC: stable@vger.kernel.org Fixes: df93589a17378 ("btrfs: export more from FS_INFO to sysfs") Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03xfs: quota: check result of register_shrinker()Aliaksei Karaliou
[ Upstream commit 3a3882ff26fbdbaf5f7e13f6a0bccfbf7121041d ] xfs_qm_init_quotainfo() does not check result of register_shrinker() which was tagged as __must_check recently, reported by sparse. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> [darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03xfs: quota: fix missed destroy of qi_tree_lockAliaksei Karaliou
[ Upstream commit 2196881566225f3c3428d1a5f847a992944daa5b ] xfs_qm_destroy_quotainfo() does not destroy quotainfo->qi_tree_lock while destroys quotainfo->qi_quotaofflock. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03btrfs: Fix flush bio leakNikolay Borisov
[ Upstream commit beed9263f4000c48a5c48912f26576f6fa091181 ] Commit e0ae99941423 ("btrfs: preallocate device flush bio") reworked the way the flush bio is allocated and used. Concretely it allocates the bio in __alloc_device and then re-uses it multiple times with a very simple endio routine that just calls complete() without consuming a reference. Allocated bios by default come with a ref count of 1, which is then consumed by the endio routine (or not, in which case they should be bio_put by the caller). The way the impleementation works now is that the flush bio has a refcount of 2 and we only ever bio_put it once, leaving it to hang indefinitely. Fix this by removing the extra bio_get in __alloc_device. Fixes: e0ae99941423 ("btrfs: preallocate device flush bio") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03afs: Fix missing error handling in afs_write_end()David Howells
[ Upstream commit afae457d874860a7e299d334f59eede5f3ad4b47 ] afs_write_end() is missing page unlock and put if afs_fill_page() fails. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03sget(): handle failures of register_shrinker()Al Viro
[ Upstream commit 9ee332d99e4d5a97548943b81c54668450ce641b ] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03exec: avoid gcc-8 warning for get_task_commArnd Bergmann
[ Upstream commit 3756f6401c302617c5e091081ca4d26ab604bec5 ] gcc-8 warns about using strncpy() with the source size as the limit: fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess] This is indeed slightly suspicious, as it protects us from source arguments without NUL-termination, but does not guarantee that the destination is terminated. This keeps the strncpy() to ensure we have properly padded target buffer, but ensures that we use the correct length, by passing the actual length of the destination buffer as well as adding a build-time check to ensure it is exactly TASK_COMM_LEN. There are only 23 callsites which I all reviewed to ensure this is currently the case. We could get away with doing only the check or passing the right length, but it doesn't hurt to do both. Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Kees Cook <keescook@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Serge Hallyn <serge@hallyn.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Aleksa Sarai <asarai@suse.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25btrfs: Fix possible off-by-one in btrfs_search_path_in_treeNikolay Borisov
[ Upstream commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 ] The name char array passed to btrfs_search_path_in_tree is of size BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes are in the range of [0, 4079]. Currently the code uses the define but this represents an off-by-one. Implications: Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be written to extra space, not some padding that could be provided by the allocator. btrfs-progs store the arguments on stack, but kernel does own copy of the ioctl buffer and the off-by-one overwrite does not affect userspace, but the ending 0 might be lost. Kernel ioctl buffer is allocated dynamically so we're overwriting somebody else's memory, and the ioctl is privileged if args.objectid is not 256. Which is in most cases, but resolving a subvolume stored in another directory will trigger that path. Before this patch the buffer was one byte larger, but then the -1 was not added. Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ added implications ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25Btrfs: disable FUA if mounted with nobarrierOmar Sandoval
[ Upstream commit 1b9e619c5bc8235cfba3dc4ced2fb0e3554a05d4 ] I was seeing disk flushes still happening when I mounted a Btrfs filesystem with nobarrier for testing. This is because we use FUA to write out the first super block, and on devices without FUA support, the block layer translates FUA to a flush. Even on devices supporting true FUA, using FUA when we asked for no barriers is surprising. Fixes: 387125fc722a8ed ("Btrfs: fix barrier flushes") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25btrfs: Fix quota reservation leak on preallocated filesJustin Maggard
[ Upstream commit b430b7751286b3acff2d324553c8cec4f1e87764 ] Commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to") changed the behavior of __btrfs_buffered_write() so that it first tries to get a data space reservation, and then skips the relatively expensive nocow check if the reservation succeeded. If we have quotas enabled, the data space reservation also includes a quota reservation. But in the rewrite case, the space has already been accounted for in qgroups. So btrfs_check_data_free_space() increases the quota reservation, but it never gets decreased when the data actually gets written and overwrites the pre-existing data. So we're left with both the qgroup and qgroup reservation accounting for the same space. This commit adds the missing btrfs_qgroup_free_data() call in the case of BTRFS_ORDERED_PREALLOC extents. Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to") Signed-off-by: Justin Maggard <jmaggard@netgear.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25dnotify: Handle errors from fsnotify_add_mark_locked() in fcntl_dirnotify()Jan Kara
commit b3a0066005821acdc0cdb092cb72587182ab583f upstream. fsnotify_add_mark_locked() can fail but we do not check its return value. This didn't matter before commit 9dd813c15b2c "fsnotify: Move mark list head from object into dedicated structure" as none of possible failures could happen for dnotify but after that commit -ENOMEM can be returned. Handle this error properly in fcntl_dirnotify() as otherwise we just hit BUG_ON(dn_mark->dn) in dnotify_free_mark(). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reported-by: syzkaller Fixes: 9dd813c15b2c101168808d4f5941a29985758973 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22ovl: hash directory inodes for fsnotifyAmir Goldstein
commit 31747eda41ef3c30c09c5c096b380bf54013746a upstream. fsnotify pins a watched directory inode in cache, but if directory dentry is released, new lookup will allocate a new dentry and a new inode. Directory events will be notified on the new inode, while fsnotify listener is watching the old pinned inode. Hash all directory inodes to reuse the pinned inode on lookup. Pure upper dirs are hashes by real upper inode, merge and lower dirs are hashed by real lower inode. The reference to lower inode was being held by the lower dentry object in the overlay dentry (oe->lowerstack[0]). Releasing the overlay dentry may drop lower inode refcount to zero. Add a refcount on behalf of the overlay inode to prevent that. As a by-product, hashing directory inodes also detects multiple redirected dirs to the same lower dir and uncovered redirected dir target on and returns -ESTALE on lookup. The reported issue dates back to initial version of overlayfs, but this patch depends on ovl_inode code that was introduced in kernel v4.13. Cc: <stable@vger.kernel.org> #v4.13 Reported-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix unexpected -EEXIST when creating new inodeLiu Bo
commit 900c9981680067573671ecc5cbfa7c5770be3a40 upstream. The highest objectid, which is assigned to new inode, is decided at the time of initializing fs roots. However, in cases where log replay gets processed, the btree which fs root owns might be changed, so we have to search it again for the highest objectid, otherwise creating new inode would end up with -EEXIST. cc: <stable@vger.kernel.org> v4.4-rc6+ Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix use-after-free on root->orphan_block_rsvLiu Bo
commit 1a932ef4e47984dee227834667b5ff5a334e4805 upstream. I got these from running generic/475, WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs] Call Trace: btrfs_orphan_release_metadata+0x9f/0x200 [btrfs] btrfs_orphan_del+0x10d/0x170 [btrfs] btrfs_setattr+0x500/0x640 [btrfs] notify_change+0x7ae/0x870 do_truncate+0xca/0x130 vfs_truncate+0x2ee/0x3d0 do_sys_truncate+0xaf/0xf0 SyS_truncate+0xe/0x10 entry_SYSCALL_64_fastpath+0x1f/0x96 The race is between btrfs_orphan_commit_root and btrfs_orphan_del, t1 t2 btrfs_orphan_commit_root btrfs_orphan_del spin_lock check (&root->orphan_inodes) root->orphan_block_rsv = NULL; spin_unlock atomic_dec(&root->orphan_inodes); access root->orphan_block_rsv Accessing root->orphan_block_rsv must be done before decreasing root->orphan_inodes. cc: <stable@vger.kernel.org> v3.12+ Fixes: 703c88e03524 ("Btrfs: fix tracking of orphan inode count") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctlyLiu Bo
commit e8f1bc1493855e32b7a2a019decc3c353d94daf6 upstream. This regression is introduced in commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction"). There are two problems, a) it is ->destroy_inode() that does the final free on inode, not ->evict_inode(), b) clear_inode() must be called before ->evict_inode() returns. This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); in evict() because I_CLEAR is set in clear_inode(). Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction") Cc: <stable@vger.kernel.org> # v4.7-rc6+ Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix extent state leak from tree logLiu Bo
commit 55237a5f2431a72435e3ed39e4306e973c0446b7 upstream. It's possible that btrfs_sync_log() bails out after one of the two btrfs_write_marked_extents() which convert extent state's state bit into EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY and EXTENT_NEW are searched by free_log_tree() so that those extent states with EXTENT_NEED_WAIT lead to memory leak. cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix crash due to not cleaning up tree log block's dirty bitsLiu Bo
commit 1846430c24d66e85cc58286b3319c82cd54debb2 upstream. In cases that the whole fs flips into readonly status due to failures in critical sections, then log tree's blocks are still dirty, and this leads to a crash during umount time, the crash is about use-after-free, umount -> close_ctree -> stop workers -> iput(btree_inode) -> iput_final -> write_inode_now -> ... -> queue job on stop'd workers cc: <stable@vger.kernel.org> v3.12+ Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix deadlock in run_delalloc_nocowLiu Bo
commit e89166990f11c3f21e1649d760dd35f9e410321c upstream. @cur_offset is not set back to what it should be (@cow_start) if btrfs_next_leaf() returns something wrong, and the range [cow_start, cur_offset) remains locked forever. cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22seq_file: fix incomplete reset on read from zero offsetMiklos Szeredi
commit cf5eebae2cd28d37581507668605f4d23cd7218d upstream. When resetting iterator on a zero offset we need to discard any data already in the buffer (count), and private state of the iterator (version). For example this bug results in first line being repeated in /proc/mounts if doing a zero size read before a non-zero size read. Reported-by: Rich Felker <dalias@libc.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: e522751d605d ("seq_file: reset iterator to first record for zero offset") Cc: <stable@vger.kernel.org> # v4.10 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22ext4: save error to disk in __ext4_grp_locked_error()Zhouyi Zhou
commit 06f29cc81f0350261f59643a505010531130eea0 upstream. In the function __ext4_grp_locked_error(), __save_error_info() is called to save error info in super block block, but does not sync that information to disk to info the subsequence fsck after reboot. This patch writes the error information to disk. After this patch, I think there is no obvious EXT4 error handle branches which leads to "Remounting filesystem read-only" will leave the disk partition miss the subsequence fsck. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22ext4: fix a race in the ext4 shutdown pathHarshad Shirwadkar
commit abbc3f9395c76d554a9ed27d4b1ebfb5d9b0e4ca upstream. This patch fixes a race between the shutdown path and bio completion handling. In the ext4 direct io path with async io, after submitting a bio to the block layer, if journal starting fails, ext4_direct_IO_write() would bail out pretending that the IO failed. The caller would have had no way of knowing whether or not the IO was successfully submitted. So instead, we return -EIOCBQUEUED in this case. Now, the caller knows that the IO was submitted. The bio completion handler takes care of the error. Tested: Ran the shutdown xfstest test 461 in loop for over 2 hours across 4 machines resulting in over 400 runs. Verified that the race didn't occur. Usually the race was seen in about 20-30 iterations. Signed-off-by: Harshad Shirwadkar <harshads@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22jbd2: fix sphinx kernel-doc build warningsTobin C. Harding
commit f69120ce6c024aa634a8fc25787205e42f0ccbe6 upstream. Sphinx emits various (26) warnings when building make target 'htmldocs'. Currently struct definitions contain duplicate documentation, some as kernel-docs and some as standard c89 comments. We can reduce duplication while cleaning up the kernel docs. Move all kernel-docs to right above each struct member. Use the set of all existing comments (kernel-doc and c89). Add documentation for missing struct members and function arguments. Signed-off-by: Tobin C. Harding <me@tobin.cc> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22mbcache: initialize entry->e_referenced in mb_cache_entry_create()Alexander Potapenko
commit 3876bbe27d04b848750d5310a37d6b76b593f648 upstream. KMSAN reported use of uninitialized |entry->e_referenced| in a condition in mb_cache_shrink(): ================================================================== BUG: KMSAN: use of uninitialized memory in mb_cache_shrink+0x3b4/0xc50 fs/mbcache.c:287 CPU: 2 PID: 816 Comm: kswapd1 Not tainted 4.11.0-rc5+ #2877 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x172/0x1c0 lib/dump_stack.c:52 kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:927 __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:469 mb_cache_shrink+0x3b4/0xc50 fs/mbcache.c:287 mb_cache_scan+0x67/0x80 fs/mbcache.c:321 do_shrink_slab mm/vmscan.c:397 [inline] shrink_slab+0xc3d/0x12d0 mm/vmscan.c:500 shrink_node+0x208f/0x2fd0 mm/vmscan.c:2603 kswapd_shrink_node mm/vmscan.c:3172 [inline] balance_pgdat mm/vmscan.c:3289 [inline] kswapd+0x160f/0x2850 mm/vmscan.c:3478 kthread+0x46c/0x5f0 kernel/kthread.c:230 ret_from_fork+0x29/0x40 arch/x86/entry/entry_64.S:430 chained origin: save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 [inline] kmsan_save_stack mm/kmsan/kmsan.c:317 [inline] kmsan_internal_chain_origin+0x12a/0x1f0 mm/kmsan/kmsan.c:547 __msan_store_shadow_origin_1+0xac/0x110 mm/kmsan/kmsan_instr.c:257 mb_cache_entry_create+0x3b3/0xc60 fs/mbcache.c:95 ext4_xattr_cache_insert fs/ext4/xattr.c:1647 [inline] ext4_xattr_block_set+0x4c82/0x5530 fs/ext4/xattr.c:1022 ext4_xattr_set_handle+0x1332/0x20a0 fs/ext4/xattr.c:1252 ext4_xattr_set+0x4d2/0x680 fs/ext4/xattr.c:1306 ext4_xattr_trusted_set+0x8d/0xa0 fs/ext4/xattr_trusted.c:36 __vfs_setxattr+0x703/0x790 fs/xattr.c:149 __vfs_setxattr_noperm+0x27a/0x6f0 fs/xattr.c:180 vfs_setxattr fs/xattr.c:223 [inline] setxattr+0x6ae/0x790 fs/xattr.c:449 path_setxattr+0x1eb/0x380 fs/xattr.c:468 SYSC_lsetxattr+0x8d/0xb0 fs/xattr.c:490 SyS_lsetxattr+0x77/0xa0 fs/xattr.c:486 entry_SYSCALL_64_fastpath+0x13/0x94 origin: save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 [inline] kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:198 kmsan_kmalloc+0x7f/0xe0 mm/kmsan/kmsan.c:337 kmem_cache_alloc+0x1c2/0x1e0 mm/slub.c:2766 mb_cache_entry_create+0x283/0xc60 fs/mbcache.c:86 ext4_xattr_cache_insert fs/ext4/xattr.c:1647 [inline] ext4_xattr_block_set+0x4c82/0x5530 fs/ext4/xattr.c:1022 ext4_xattr_set_handle+0x1332/0x20a0 fs/ext4/xattr.c:1252 ext4_xattr_set+0x4d2/0x680 fs/ext4/xattr.c:1306 ext4_xattr_trusted_set+0x8d/0xa0 fs/ext4/xattr_trusted.c:36 __vfs_setxattr+0x703/0x790 fs/xattr.c:149 __vfs_setxattr_noperm+0x27a/0x6f0 fs/xattr.c:180 vfs_setxattr fs/xattr.c:223 [inline] setxattr+0x6ae/0x790 fs/xattr.c:449 path_setxattr+0x1eb/0x380 fs/xattr.c:468 SYSC_lsetxattr+0x8d/0xb0 fs/xattr.c:490 SyS_lsetxattr+0x77/0xa0 fs/xattr.c:486 entry_SYSCALL_64_fastpath+0x13/0x94 ================================================================== Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22kmemcheck: remove annotationsLevin, Alexander (Sasha Levin)
commit 4950276672fce5c241857540f8561c440663673d upstream. Patch series "kmemcheck: kill kmemcheck", v2. As discussed at LSF/MM, kill kmemcheck. KASan is a replacement that is able to work without the limitation of kmemcheck (single CPU, slow). KASan is already upstream. We are also not aware of any users of kmemcheck (or users who don't consider KASan as a suitable replacement). The only objection was that since KASAN wasn't supported by all GCC versions provided by distros at that time we should hold off for 2 years, and try again. Now that 2 years have passed, and all distros provide gcc that supports KASAN, kill kmemcheck again for the very same reasons. This patch (of 4): Remove kmemcheck annotations, and calls to kmemcheck from the kernel. [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGEGang He
commit ff26cc10aec128c3f86b5611fd5f59c71d49c0e3 upstream. If we can't get inode lock immediately in the function ocfs2_inode_lock_with_page() when reading a page, we should not return directly here, since this will lead to a softlockup problem when the kernel is configured with CONFIG_PREEMPT is not set. The method is to get a blocking lock and immediately unlock before returning, this can avoid CPU resource waste due to lots of retries, and benefits fairness in getting lock among multiple nodes, increase efficiency in case modifying the same file frequently from multiple nodes. The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) looks like: Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x82 panic+0xd5/0x21e watchdog_timer_fn+0x208/0x210 __hrtimer_run_queues+0xcc/0x200 hrtimer_interrupt+0xa6/0x1f0 smp_apic_timer_interrupt+0x34/0x50 apic_timer_interrupt+0x96/0xa0 </IRQ> RIP: 0010:unlock_page+0x17/0x30 RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004 RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00 R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518 R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300 ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] ocfs2_readpage+0x41/0x2d0 [ocfs2] filemap_fault+0x12b/0x5c0 ocfs2_fault+0x29/0xb0 [ocfs2] __do_fault+0x1a/0xa0 __handle_mm_fault+0xbe8/0x1090 handle_mm_fault+0xaa/0x1f0 __do_page_fault+0x235/0x4b0 trace_do_page_fault+0x3c/0x110 async_page_fault+0x28/0x30 RIP: 0033:0x7fa75ded638e RSP: 002b:00007ffd6657db18 EFLAGS: 00010287 RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700 RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700 RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000 R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770 R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000 About performance improvement, we can see the testing time is reduced, and CPU utilization decreases, the detailed data is as follows. I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. Before applying this patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared Tests with "-b 4096 -C 32768" Thu Dec 28 14:44:52 CST 2017 multi_mmap..................................................Passed. Runtime 783 seconds. After apply this patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared Tests with "-b 4096 -C 32768" Thu Dec 28 15:04:12 CST 2017 multi_mmap..................................................Passed. Runtime 487 seconds. Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock") Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Eric Ren <zren@suse.com> Acked-by: alex chen <alex.chen@huawei.com> Acked-by: piaojun <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16devpts: fix error handling in devpts_mntget()Eric Biggers
commit c9cc8d01fb04117928830449388512a5047569c9 upstream. If devpts_ptmx_path() returns an error code, then devpts_mntget() dereferences an ERR_PTR(): BUG: unable to handle kernel paging request at fffffffffffffff5 IP: devpts_mntget+0x13f/0x280 fs/devpts/inode.c:173 Fix it by returning early in the error paths. Reproducer: #define _GNU_SOURCE #include <fcntl.h> #include <sched.h> #include <sys/ioctl.h> #define TIOCGPTPEER _IO('T', 0x41) int main() { for (;;) { int fd = open("/dev/ptmx", 0); unshare(CLONE_NEWNS); ioctl(fd, TIOCGPTPEER, 0); } } Fixes: 311fc65c9fb9 ("pty: Repair TIOCGPTPEER") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16ovl: take mnt_want_write() for removing impure xattrAmir Goldstein
commit a5a927a7c82e28ea76599dee4019c41e372c911f upstream. The optimization in ovl_cache_get_impure() that tries to remove an unneeded "impure" xattr needs to take mnt_want_write() on upper fs. Fixes: 4edb83bb1041 ("ovl: constant d_ino for non-merge dirs") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16ovl: fix failure to fsync lower dirAmir Goldstein
commit d796e77f1dd541fe34481af2eee6454688d13982 upstream. As a writable mount, it is not expected for overlayfs to return EINVAL/EROFS for fsync, even if dir/file is not changed. This commit fixes the case of fsync of directory, which is easier to address, because overlayfs already implements fsync file operation for directories. The problem reported by Raphael is that new PostgreSQL 10.0 with a database in overlayfs where lower layer in squashfs fails to start. The failure is due to fsync error, when PostgreSQL does fsync on all existing db directories on startup and a specific directory exists lower layer with no changes. Reported-by: Raphael Hertzog <raphael@ouaza.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Tested-by: Raphaƫl Hertzog <hertzog@debian.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16btrfs: Handle btrfs_set_extent_delalloc failure in fixup workerNikolay Borisov
commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream. This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers to deal with pages that have been improperly dirtied") and it didn't do any error handling then. This function might very well fail in ENOMEM situation, yet it's not handled, this could lead to inconsistent state. So let's handle the failure by setting the mapping error bit. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>