aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/super.c
AgeCommit message (Collapse)Author
2020-09-17btrfs: fix messages after changing compression level by remountDavid Sterba
commit 27942c9971cc405c60432eca9395e514a2ae9f5e upstream. Reported by Forza on IRC that remounting with compression options does not reflect the change in level, or at least it does not appear to do so according to the messages: mount -o compress=zstd:1 /dev/sda /mnt mount -o remount,compress=zstd:15 /mnt does not print the change to the level to syslog: [ 41.366060] BTRFS info (device vda): use zstd compression, level 1 [ 41.368254] BTRFS info (device vda): disk space caching is enabled [ 41.390429] BTRFS info (device vda): disk space caching is enabled What really happens is that the message is lost but the level is actualy changed. There's another weird output, if compression is reset to 'no': [ 45.413776] BTRFS info (device vda): use no compression, level 4 To fix that, save the previous compression level and print the message in that case too and use separate message for 'no' compression. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-09-17btrfs: don't traverse into the seed devices in show_devnameAnand Jain
commit 4faf55b03823e96c44dc4e364520000ed3b12fdb upstream. ->show_devname currently shows the lowest devid in the list. As the seed devices have the lowest devid in the sprouted filesystem, the userland tool such as findmnt end up seeing seed device instead of the device from the read-writable sprouted filesystem. As shown below. mount /dev/sda /btrfs mount: /btrfs: WARNING: device write-protected, mounted read-only. findmnt --output SOURCE,TARGET,UUID /btrfs SOURCE TARGET UUID /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111 btrfs dev add -f /dev/sdb /btrfs umount /btrfs mount /dev/sdb /btrfs findmnt --output SOURCE,TARGET,UUID /btrfs SOURCE TARGET UUID /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111 All sprouts from a single seed will show the same seed device and the same fsid. That's confusing. This is causing problems in our prototype as there isn't any reference to the sprout file-system(s) which is being used for actual read and write. This was added in the patch which implemented the show_devname in btrfs commit 9c5085c14798 ("Btrfs: implement ->show_devname"). I tried to look for any particular reason that we need to show the seed device, there isn't any. So instead, do not traverse through the seed devices, just show the lowest devid in the sprouted fsid. After the patch: mount /dev/sda /btrfs mount: /btrfs: WARNING: device write-protected, mounted read-only. findmnt --output SOURCE,TARGET,UUID /btrfs SOURCE TARGET UUID /dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111 btrfs dev add -f /dev/sdb /btrfs mount -o rw,remount /dev/sdb /btrfs findmnt --output SOURCE,TARGET,UUID /btrfs SOURCE TARGET UUID /dev/sdb /btrfs 595ca0e6-b82e-46b5-b9e2-c72a6928be48 mount /dev/sda /btrfs1 mount: /btrfs1: WARNING: device write-protected, mounted read-only. btrfs dev add -f /dev/sdc /btrfs1 findmnt --output SOURCE,TARGET,UUID /btrfs1 SOURCE TARGET UUID /dev/sdc /btrfs1 ca1dbb7a-8446-4f95-853c-a20f3f82bdbb cat /proc/self/mounts | grep btrfs /dev/sdb /btrfs btrfs rw,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0 /dev/sdc /btrfs1 btrfs ro,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0 Reported-by: Martin K. Petersen <martin.petersen@oracle.com> CC: stable@vger.kernel.org # 4.19+ Tested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-05-04btrfs: log message when rw remount is attempted with unclean tree-logDavid Sterba
commit 10a3a3edc5b89a8cd095bc63495fb1e0f42047d9 upstream. A remount to a read-write filesystem is not safe when there's tree-log to be replayed. Files that could be opened until now might be affected by the changes in the tree-log. A regular mount is needed to replay the log so the filesystem presents the consistent view with the pending changes included. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2020-04-16btrfs: do not zero f_bavail if we have available spaceJosef Bacik
commit d55966c4279bfc6a0cf0b32bf13f5df228a1eeb6 upstream. There was some logic added a while ago to clear out f_bavail in statfs() if we did not have enough free metadata space to satisfy our global reserve. This was incorrect at the time, however didn't really pose a problem for normal file systems because we would often allocate chunks if we got this low on free metadata space, and thus wouldn't really hit this case unless we were actually full. Fast forward to today and now we are much better about not allocating metadata chunks all of the time. Couple this with d792b0f19711 ("btrfs: always reserve our entire size for the global reserve") which now means we'll easily have a larger global reserve than our free space, we are now more likely to trip over this while still having plenty of space. Fix this by skipping this logic if the global rsv's space_info is not full. space_info->full is 0 unless we've attempted to allocate a chunk for that space_info and that has failed. If this happens then the space for the global reserve is definitely sacred and we need to report b_avail == 0, but before then we can just use our calculated b_avail. Reported-by: Martin Steigerwald <martin@lichtvoll.de> Fixes: ca8a51b3a979 ("btrfs: statfs: report zero available if metadata are exhausted") CC: stable@vger.kernel.org # 4.5+ Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-By: Martin Steigerwald <martin@lichtvoll.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2019-05-07Merge tag 'for-5.2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This time the majority of changes are cleanups, though there's still a number of changes of user interest. User visible changes: - better read time and write checks to catch errors early and before writing data to disk (to catch potential memory corruption on data that get checksummed) - qgroups + metadata relocation: last speed up patch int the series to address the slowness, there should be no overhead comparing balance with and without qgroups - FIEMAP ioctl does not start a transaction unnecessarily, this can result in a speed up and less blocking due to IO - LOGICAL_INO (v1, v2) does not start transaction unnecessarily, this can speed up the mentioned ioctl and scrub as well - fsync on files with many (but not too many) hardlinks is faster, finer decision if the links should be fsynced individually or completely - send tries harder to find ranges to clone - trim/discard will skip unallocated chunks that haven't been touched since the last mount Fixes: - send flushes delayed allocation before start, otherwise it could miss some changes in case of a very recent rw->ro switch of a subvolume - fix fallocate with qgroups that could lead to space accounting underflow, reported as a warning - trim/discard ioctl honours the requested range - starting send and dedupe on a subvolume at the same time will let only one of them succeed, this is to prevent changes that send could miss due to dedupe; both operations are restartable Core changes: - more tree-checker validations, errors reported by fuzzing tools: - device item - inode item - block group profiles - tracepoints for extent buffer locking - async cow preallocates memory to avoid errors happening too deep in the call chain - metadata reservations for delalloc reworked to better adapt in many-writers/low-space scenarios - improved space flushing logic for intense DIO vs buffered workloads - lots of cleanups - removed unused struct members - redundant argument removal - properties and xattrs - extent buffer locking - selftests - use common file type conversions - many-argument functions reduction" * tag 'for-5.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (227 commits) btrfs: Use kvmalloc for allocating compressed path context btrfs: Factor out common extent locking code in submit_compressed_extents btrfs: Set io_tree only once in submit_compressed_extents btrfs: Replace clear_extent_bit with unlock_extent btrfs: Make compress_file_range take only struct async_chunk btrfs: Remove fs_info from struct async_chunk btrfs: Rename async_cow to async_chunk btrfs: Preallocate chunks in cow_file_range_async btrfs: reserve delalloc metadata differently btrfs: track DIO bytes in flight btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_prop btrfs: delete unused function btrfs_set_prop_trans btrfs: start transaction in xattr_handler_set_prop btrfs: drop local copy of inode i_mode btrfs: drop old_fsflags in btrfs_ioctl_setflags btrfs: modify local copy of btrfs_inode flags btrfs: drop useless inode i_flags copy and restore btrfs: start transaction in btrfs_ioctl_setflags() btrfs: export btrfs_set_prop btrfs: refactor btrfs_set_props to validate externally ...
2019-05-01btrfs: use ->free_inode()Al Viro
a lot of stuff remains in ->destroy_inode() Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-29btrfs: drop unused parameter in mount_subvolAnand Jain
@device_name in mount_subvol() is not used, drop it. Also see: 5bedc48a8f9e ("btrfs: drop unused parameters from mount_subvol"). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: add zstd compression level supportDennis Zhou
Zstd compression requires different amounts of memory for each level of compression. The prior patches implemented indirection to allow for each compression type to manage their workspaces independently. This patch uses this indirection to implement compression level support for zstd. To manage the additional memory require, each compression level has its own queue of workspaces. A global LRU is used to help with reclaim. Reclaim is done via a timer which provides a mechanism to decrease memory utilization by keeping only workspaces around that are sized appropriately. Forward progress is guaranteed by a preallocated max workspace hidden from the LRU. When getting a workspace, it uses a bitmap to identify the levels that are populated and scans up. If it finds a workspace that is greater than it, it uses it, but does not update the last_used time and the corresponding place in the LRU. If we hit memory pressure, we sleep on the max level workspace. We continue to rescan in case we can use a smaller workspace, but eventually should be able to obtain the max level workspace or allocate one again should memory pressure subside. The memory requirement for decompression is the same as level 1, and therefore can use any of available workspace. The number of workspaces is bound by an upper limit of the workqueue's limit which currently is 2 (percpu limit). The reclaim timer is used to free inactive/improperly sized workspaces and is set to 307s to avoid colliding with transaction commit (every 30s). Repeating the experiment from v2 [1], the Silesia corpus was copied to a btrfs filesystem 10 times and then read back after dropping the caches. The btrfs filesystem was on an SSD. Level Ratio Compression (MB/s) Decompression (MB/s) Memory (KB) 1 2.658 438.47 910.51 780 2 2.744 364.86 886.55 1004 3 2.801 336.33 828.41 1260 4 2.858 286.71 886.55 1260 5 2.916 212.77 556.84 1388 6 2.363 119.82 990.85 1516 7 3.000 154.06 849.30 1516 8 3.011 159.54 875.03 1772 9 3.025 100.51 940.15 1772 10 3.033 118.97 616.26 1772 11 3.036 94.19 802.11 1772 12 3.037 73.45 931.49 1772 13 3.041 55.17 835.26 2284 14 3.087 44.70 716.78 2547 15 3.126 37.30 878.84 2547 [1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terrelln@fb.com/ Cc: Nick Terrell <terrelln@fb.com> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: change set_level() to bound the level passed inDennis Zhou
Currently, the only user of set_level() is zlib which sets an internal workspace parameter. As level is now plumbed into get_workspace(), this can be handled there rather than separately. This repurposes set_level() to bound the level passed in so it can be used when setting the mounts compression level and as well as verifying the level before getting a workspace. The other benefit is this divides the meaning of compress(0) and get_workspace(0). The former means we want to use the default compression level of the compression type. The latter means we can use any workspace available. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: introduce new ioctl to unregister a btrfs deviceAnand Jain
Support for a new command that can be used eg. as a command $ btrfs device scan --forget [dev]' (the final name may change though) to undo the effects of 'btrfs device scan [dev]'. For this purpose this patch proposes to use ioctl #5 as it was empty and is next to the SCAN ioctl. The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device (/dev/btrfs-control) to unregister one or all devices, devices that are not mounted. The argument is struct btrfs_ioctl_vol_args, ::name specifies the device path. To unregister all device, the path is an empty string. Again, the devices are removed only if they aren't part of a mounte filesystem. This new ioctl provides: - release of unwanted btrfs_fs_devices and btrfs_devices structures from memory if the device is not going to be mounted - ability to mount filesystem in degraded mode, when one devices is corrupted like in split brain raid1 - running test cases which would require reloading the kernel module but this is not possible eg. due to mounted filesystem or built-in Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-03Merge tag 'for-5.0-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - regression fix: transaction commit can run away due to delayed ref waiting heuristic, this is not necessary now because of the proper reservation mechanism introduced in 5.0 - regression fix: potential crash due to use-before-check of an ERR_PTR return value - fix for transaction abort during transaction commit that needs to properly clean up pending block groups - fix deadlock during b-tree node/leaf splitting, when this happens on some of the fundamental trees, we must prevent new tree block allocation to re-enter indirectly via the block group flushing path - potential memory leak after errors during mount * tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: On error always free subvol_name in btrfs_mount btrfs: clean up pending block groups when transaction commit aborts btrfs: fix potential oops in device_list_add btrfs: don't end the transaction for delayed refs in throttle Btrfs: fix deadlock when allocating tree block during leaf/node split
2019-01-30btrfs: On error always free subvol_name in btrfs_mountEric W. Biederman
The subvol_name is allocated in btrfs_parse_subvol_options and is consumed and freed in mount_subvol. Add a free to the error paths that don't call mount_subvol so that it is guaranteed that subvol_name is freed when an error happens. Fixes: 312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()") Cc: stable@vger.kernel.org # v4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-05Merge branch 'mount.part1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount API prep from Al Viro: "Mount API prereqs. Mostly that's LSM mount options cleanups. There are several minor fixes in there, but nothing earth-shattering (leaks on failure exits, mostly)" * 'mount.part1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (27 commits) mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNT smack: rewrite smack_sb_eat_lsm_opts() smack: get rid of match_token() smack: take the guts of smack_parse_opts_str() into a new helper LSM: new method: ->sb_add_mnt_opt() selinux: rewrite selinux_sb_eat_lsm_opts() selinux: regularize Opt_... names a bit selinux: switch away from match_token() selinux: new helper - selinux_add_opt() LSM: bury struct security_mnt_opts smack: switch to private smack_mnt_opts selinux: switch to private struct selinux_mnt_opts LSM: hide struct security_mnt_opts from any generic code selinux: kill selinux_sb_get_mnt_opts() LSM: turn sb_eat_lsm_opts() into a method nfs_remount(): don't leak, don't ignore LSM options quietly btrfs: sanitize security_mnt_opts use selinux; don't open-code a loop in sb_finish_set_opts() LSM: split ->sb_set_mnt_opts() out of ->sb_kern_mount() new helper: security_sb_eat_lsm_opts() ...
2018-12-21LSM: hide struct security_mnt_opts from any generic codeAl Viro
Keep void * instead, allocate on demand (in parse_str_opts, at the moment). Eventually both selinux and smack will be better off with private structures with several strings in those, rather than this "counter and two pointers to dynamically allocated arrays" ugliness. This commit allows to do that at leisure, without disrupting anything outside of given module. Changes: * instead of struct security_mnt_opt use an opaque pointer initialized to NULL. * security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and security_free_mnt_opts() take it as var argument (i.e. as void **); call sites are unchanged. * security_sb_set_mnt_opts() and security_sb_remount() take it by value (i.e. as void *). * new method: ->sb_free_mnt_opts(). Takes void *, does whatever freeing that needs to be done. * ->sb_set_mnt_opts() and ->sb_remount() might get NULL as mnt_opts argument, meaning "empty". Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21btrfs: sanitize security_mnt_opts useAl Viro
1) keeping a copy in btrfs_fs_info is completely pointless - we never use it for anything. Getting rid of that allows for simpler calling conventions for setup_security_options() (caller is responsible for freeing mnt_opts in all cases). 2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(), same as we would if not for FS_BINARY_MOUNTDATA. Behaviours *are* close (in fact, selinux sb_set_mnt_opts() ought to punt to sb_remount() in "already initialized" case), but let's handle that uniformly. And the only reason why the original btrfs changes didn't go for security_sb_remount() in btrfs_remount() case is that it hadn't been exported. Let's export it for a while - it'll be going away soon anyway. Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21new helper: security_sb_eat_lsm_opts()Al Viro
combination of alloc_secdata(), security_sb_copy_data(), security_sb_parse_opt_str() and free_secdata(). Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-17btrfs: Fix typos in comments and stringsAndrea Gelmini
The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove fsid/metadata_fsid fields from btrfs_infoNikolay Borisov
Currently btrfs_fs_info structure contains a copy of the fsid/metadata_uuid fields. Same values are also contained in the btrfs_fs_devices structure which fs_info has a reference to. Let's reduce duplication by removing the fields from fs_info and always refer to the ones in fs_devices. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-14Btrfs: ensure path name is null terminated at btrfs_control_ioctlFilipe Manana
We were using the path name received from user space without checking that it is null terminated. While btrfs-progs is well behaved and does proper validation and null termination, someone could call the ioctl and pass a non-null terminated patch, leading to buffer overrun problems in the kernel. The ioctl is protected by CAP_SYS_ADMIN. So just set the last byte of the path to a null character, similar to what we do in other ioctls (add/remove/resize device, snapshot creation, etc). CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06btrfs: avoid link error with CONFIG_NO_AUTO_INLINEArnd Bergmann
Note: this patch fixes a problem in a feature outside of btrfs ("kernel hacking: add a config option to disable compiler auto-inlining") and is applied ahead of time due to cross-subsystem dependencies. On 32-bit ARM with gcc-8, I see a link error with the addition of the CONFIG_NO_AUTO_INLINE option: fs/btrfs/super.o: In function `btrfs_statfs': super.c:(.text+0x67b8): undefined reference to `__aeabi_uldivmod' super.c:(.text+0x67fc): undefined reference to `__aeabi_uldivmod' super.c:(.text+0x6858): undefined reference to `__aeabi_uldivmod' super.c:(.text+0x6920): undefined reference to `__aeabi_uldivmod' super.c:(.text+0x693c): undefined reference to `__aeabi_uldivmod' fs/btrfs/super.o:super.c:(.text+0x6958): more undefined references to `__aeabi_uldivmod' follow So far this is the only file that shows the behavior, so I'd propose to just work around it by marking the functions as 'static inline' that normally get inlined here. The reference to __aeabi_uldivmod comes from a div_u64() which has an optimization for a constant division that uses a straight '/' operator when the result should be known to the compiler. My interpretation is that as we turn off inlining, gcc still expects the result to be constant but fails to use that constant value. Link: https://lkml.kernel.org/r/20181103153941.1881966-1-arnd@arndb.de Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> [ add the note ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: Remove 'objectid' member from struct btrfs_rootMisono Tomohiro
There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: Use wrapper macro for rcu string to remove duplicate codeMisono Tomohiro
Cleanup patch and no functional changes. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06Btrfs: fix btrfs_write_inode vs delayed iput deadlockJosef Bacik
We recently ran into the following deadlock involving btrfs_write_inode(): [ +0.005066] __schedule+0x38e/0x8c0 [ +0.007144] schedule+0x36/0x80 [ +0.006447] bit_wait+0x11/0x60 [ +0.006446] __wait_on_bit+0xbe/0x110 [ +0.007487] ? bit_wait_io+0x60/0x60 [ +0.007319] __inode_wait_for_writeback+0x96/0xc0 [ +0.009568] ? autoremove_wake_function+0x40/0x40 [ +0.009565] inode_wait_for_writeback+0x21/0x30 [ +0.009224] evict+0xb0/0x190 [ +0.006099] iput+0x1a8/0x210 [ +0.006103] btrfs_run_delayed_iputs+0x73/0xc0 [ +0.009047] btrfs_commit_transaction+0x799/0x8c0 [ +0.009567] btrfs_write_inode+0x81/0xb0 [ +0.008008] __writeback_single_inode+0x267/0x320 [ +0.009569] writeback_sb_inodes+0x25b/0x4e0 [ +0.008702] wb_writeback+0x102/0x2d0 [ +0.007487] wb_workfn+0xa4/0x310 [ +0.006794] ? wb_workfn+0xa4/0x310 [ +0.007143] process_one_work+0x150/0x410 [ +0.008179] worker_thread+0x6d/0x520 [ +0.007490] kthread+0x12c/0x160 [ +0.006620] ? put_pwq_unlocked+0x80/0x80 [ +0.008185] ? kthread_park+0xa0/0xa0 [ +0.007484] ? do_syscall_64+0x53/0x150 [ +0.007837] ret_from_fork+0x29/0x40 Writeback calls: btrfs_write_inode btrfs_commit_transaction btrfs_run_delayed_iputs If iput() is called on that same inode, evict() will wait for writeback forever. btrfs_write_inode() was originally added way back in 4730a4bc5bf3 ("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode() hasn't been used for O_SYNC since 148f948ba877 ("vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so btrfs_write_inode() is actually unnecessary (and leads to a bunch of unnecessary commits). Get rid of it, which also gets rid of the deadlock. CC: stable@vger.kernel.org # 3.2+ Signed-off-by: Josef Bacik <jbacik@fb.com> [Omar: new commit message] Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: refactor block group replication factor calculation to a helperDavid Sterba
There are many places that open code the duplicity factor of the block group profiles, create a common helper. This can be easily extended for more copies. Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: rename btrfs_parse_early_optionsAnand Jain
Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As btrfs_parse_early_options() parses the -o device options and scan the device provided. So this rename specifies its action. Also the function name is in line with btrfs_parse_subvol_options(). No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: return device pointer from btrfs_scan_one_deviceGu Jinxiang
Return device pointer (with the IS_ERR semantics) from btrfs_scan_one_device so we don't have to return in through pointer. And since btrfs_fs_devices can be obtained from btrfs_device, return that. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ fixed conflics after recent changes to btrfs_scan_one_device ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: make fs_devices a local variable in btrfs_parse_early_optionsGu Jinxiang
fs_devices is always passed to btrfs_scan_one_device which overrides it. In the call stack below fs_devices is passed to btrfs_scan_one_device from btrfs_mount_root. In btrfs_mount_root the output fs_devices of this call stack is not used. btrfs_mount_root btrfs_parse_early_options btrfs_scan_one_device So, it is not necessary to pass fs_devices from btrfs_mount_root, using a local variable in btrfs_parse_early_options is enough. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: fix mount and ioctl device scan ioctl raceDavid Sterba
Technically this extends the critical section covered by uuid_mutex to: - parse early mount options -- here we can call device scan on paths that can be passed as 'device=/dev/...' - scan the device passed to mount - open the devices related to the fs_devices -- this increases fs_devices::opened The race can happen when mount calls one of the scans and there's another one called eg. by mkfs or 'btrfs dev scan': Mount Scan ----- ---- scan_one_device (dev1, fsid1) scan_one_device (dev2, fsid1) add the device free stale devices fsid1 fs_devices::opened == 0 find fsid1:dev1 free fsid1:dev1 if it's the last one, free fs_devices of fsid1 too open_devices (dev1, fsid1) dev1 not found When fixed, the uuid mutex will make sure that mount will increase fs_devices::opened and this will not be touched by the racing scan ioctl. Reported-and-tested-by: syzbot+909a5177749d7990ffa4@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+ceb2606025ec1cc3479c@syzkaller.appspotmail.com Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: reorder initialization before the mount locks uuid_mutexDavid Sterba
In preparation to take a big lock, move resource initialization before the critical section. It's not obvious from the diff, the desired order is: - initialize mount security options - allocate temporary fs_info - allocate superblock buffers Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: lift uuid_mutex to callers of btrfs_parse_early_optionsDavid Sterba
Prepartory work to fix race between mount and device scan. btrfs_parse_early_options calls the device scan from mount and we'll need to let mount completely manage the critical section. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: lift uuid_mutex to callers of btrfs_open_devicesDavid Sterba
Prepartory work to fix race between mount and device scan. The callers will have to manage the critical section, eg. mount wants to scan and then call btrfs_open_devices without the ioctl scan walking in and modifying the fs devices in the meantime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: lift uuid_mutex to callers of btrfs_scan_one_deviceDavid Sterba
Prepartory work to fix race between mount and device scan. The callers will have to manage the critical section, eg. mount wants to scan and then call btrfs_open_devices without the ioctl scan walking in and modifying the fs devices in the meantime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: remove unused parameter from btrfs_parse_subvol_optionsGu Jinxiang
Since parameter flags is no more used since commit d7407606564c ("btrfs: split parse_early_options() in two"), remove it. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: prune unused includesDavid Sterba
Remove includes if none of the interfaces and exports is used in the given source file. Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: Fix a C compliance issueBart Van Assche
The C programming language does not allow to use preprocessor statements inside macro arguments (pr_info() is defined as a macro). Hence rework the pr_info() statement in btrfs_print_mod_info() such that it becomes compliant. This patch allows tools like sparse to analyze the BTRFS source code. Fixes: 62e855771dac ("btrfs: convert printk(KERN_* to use pr_* calls") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06btrfs: Annotate fall-through when parsing mount optionBart Van Assche
This patch avoids that the compiler complains that a fall-through annotation is missing when building with W=1. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28Btrfs: allow empty subvol= againOmar Sandoval
I got a report that after upgrading to 4.16, someone's filesystems weren't mounting: [ 23.845852] BTRFS info (device loop0): unrecognized mount option 'subvol=' Before 4.16, this mounted the default subvolume. It turns out that this empty "subvol=" is actually an application bug, but it was causing the application to fail, so it's an ABI break if you squint. The generic parsing code we use for mount options (match_token()) doesn't match an empty string as "%s". Previously, setup_root_args() removed the "subvol=" string, but the mount path was cleaned up to not need that. Add a dummy Opt_subvol_empty to fix this. The simple workaround is to use / or . for the value of 'subvol=' . Fixes: 312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()") CC: stable@vger.kernel.org # 4.16+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28btrfs: return original error code when failing from option parsingChengguang Xu
It's not good to overwrite -ENOMEM using -EINVAL when failing from mount option parsing, so just return original error code. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12btrfs: replace GPL boilerplate by SPDX -- sourcesDavid Sterba
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: use RCU in btrfs_show_devname for device list traversalDavid Sterba
The show_devname callback is used to print device name in /proc/self/mounts, we need to traverse the device list consistently and read the name that's copied to a seq buffer so we don't need further locking. If the first device is being deleted at the same time, the RCU will allow us to read the device name, though it will become stale right after the RCU protection ends. This is unavoidable and the user can expect that the device will disappear from the filesystem's list at some point. The device_list_mutex was pretty heavy as it is used eg. for writing superblock and a few other IO related contexts. This can stall any application that reads the proc file for no reason. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: sort and group mount option definitionsDavid Sterba
Sort mount options by the primary name, followed by the 'no-' counterpart if it exists. Group the deprecated and debugging options. Enum and token defintions are synced. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31btrfs: Add nossd_spread mount optionHoward McLauchlan
Btrfs has two mount options for SSD optimizations: ssd and ssd_spread. Presently there is an option to disable all SSD optimizations, but there isn't an option to disable just ssd_spread. This patch adds a mount option nossd_spread that disables ssd_spread only. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: unify types for metadata_ratio and data_chunk_allocationsAnand Jain
We have btrfs_fs_info::data_chunk_allocations and btrfs_fs_info::metadata_ratio declared as unsigned which would be unsinged int and kernel style prefers unsigned int over bare unsigned. So this patch changes them to u32. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: add more __cold annotationsDavid Sterba
The __cold functions are placed to a special section, as they're expected to be called rarely. This could help i-cache prefetches or help compiler to decide which branches are more/less likely to be taken without any other annotations needed. Though we can't add more __exit annotations, it's still possible to add __cold (that's also added with __exit). That way the following function categories are tagged: - printf wrappers, error messages - exit helpers Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: verify subvolid mount parameterAnand Jain
We aren't verifying the parameter passed to the subvolid mount option, so we won't report and fail the mount if a junk value is specified for example, -o subvolid=abc. This patch verifies the subvolid option with match_u64. Up to now the memparse function accepts the K/M/G/ suffixes, that are usually meant for size values and do not make sense for a subvolume it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: Remove custom crc32c init codeNikolay Borisov
The custom crc32 init code was introduced in 14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to enable using btrfs as a built-in. However, later as pointed out by 60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this wasn't enough and finally btrfs was switched to late_initcall which comes after the generic crc32c implementation is initiliased. The latter commit superseeded the former. Now that we don't have to maintain our own code let's just remove it and switch to using the generic implementation. Despite touching a lot of files the patch is really simple. Here is the gist of the changes: 1. Select LIBCRC32C rather than the low-level modules. 2. s/btrfs_crc32c/crc32c/g 3. replace hash.h with linux/crc32c.h 4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly. I've tested this with btrfs being both a module and a built-in and xfstest doesn't complain. Does seem to fix the longstanding problem of not automatically selectiong the crc32c module when btrfs is used. Possibly there is a workaround in dracut. The modinfo confirms that now all the module dependencies are there: before: depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate after: depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add more info to changelog from mails ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: add a comment to mark the deprecated mount optionAnand Jain
The options alloc_start and subvolrootid are deprecated, comment them in the tokens list. And leave them as it is. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: manage commit mount option as %uAnand Jain
As the commit mount option is unsigned so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: manage check_int_print_mask mount option as %uAnand Jain
As check_int_print_mask mount option is unsigned so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: manage metadata_ratio mount option as %uAnand Jain
As metadata_ratio mount option is unsinged so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>