summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/super.c
AgeCommit message (Collapse)Author
2015-09-10Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlockFilipe Manana
After commmit e44163e17796 ("btrfs: explictly delete unused block groups in close_ctree and ro-remount"), added in the 4.3 merge window, we have calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex. This can cause a deadlock with a concurrent block group relocation (when a filesystem balance or shrink operation is in progress for example) because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the relocation path locks first delete_unused_bgs_mutex and then it locks cleaner_mutex, resulting in a classic ABBA deadlock: CPU 0 CPU 1 lock fs_info->cleaner_mutex __btrfs_balance() || btrfs_shrink_device() lock fs_info->delete_unused_bgs_mutex btrfs_relocate_chunk() btrfs_relocate_block_group() lock fs_info->cleaner_mutex btrfs_delete_unused_bgs() lock fs_info->delete_unused_bgs_mutex Fix this by not taking the cleaner_mutex before calling btrfs_delete_unused_bgs() because it's no longer needed after commit 67c5e7d464bc ("Btrfs: fix race between balance and unused block group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the spinlock fs_info->unused_bgs_lock and a block group's spinlock are enough to get correct serialization between tasks running relocation and unused block group deletion (as well as between multiple tasks concurrently calling btrfs_delete_unused_bgs()). This issue was discussed (in the mailing list) during the review of the patch titled "btrfs: explictly delete unused block groups in close_ctree and ro-remount" and it was agreed that acquiring the cleaner mutex had to be dropped after the patch titled "Btrfs: fix race between balance and unused block group deletion" got merged (both patches were submitted at about the same time, but one landed in kernel 4.2 and the other in the 4.3 merge window). Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-08-09Merge branch 'jeffm-discard-4.3' into for-linus-4.3Chris Mason
2015-08-09Btrfs: add support for blkio controllersChris Mason
This attaches accounting information to bios as we submit them so the new blkio controllers can throttle on btrfs filesystems. Not much is required, we're just associating bios with blkcgs during clone, calling wbc_init_bio()/wbc_account_io() during writepages submission, and attaching the bios to the current context during direct IO. Finally if we are splitting bios during btrfs_map_bio, this attaches accounting information to the split. The end result is able to throttle nicely on single disk filesystems. A little more work is required for multi-device filesystems. Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29btrfs: add missing discards when unpinning extents with -o discardJeff Mahoney
When we clear the dirty bits in btrfs_delete_unused_bgs for extents in the empty block group, it results in btrfs_finish_extent_commit being unable to discard the freed extents. The block group removal patch added an alternate path to forget extents other than btrfs_finish_extent_commit. As a result, any extents that would be freed when the block group is removed aren't discarded. In my test run, with a large copy of mixed sized files followed by removal, it left nearly 2/3 of extents undiscarded. To clean up the block groups, we add the removed block group onto a list that will be discarded after transaction commit. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29btrfs: explictly delete unused block groups in close_ctree and ro-remountJeff Mahoney
The cleaner thread may already be sleeping by the time we enter close_ctree. If that's the case, we'll skip removing any unused block groups queued for removal, even during a normal umount. They'll be cleaned up automatically at next mount, but users expect a umount to be a clean synchronization point, especially when used on thin-provisioned storage with -odiscard. We also explicitly remove unused block groups in the ro-remount path for the same reason. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: show subvol= and subvolid= in /proc/mountsOmar Sandoval
Now that we're guaranteed to have a meaningful root dentry, we can just export seq_dentry() and use it in btrfs_show_options(). The subvolume ID is easy to get and can also be useful, so put that in there, too. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: unify subvol= and subvolid= mountingOmar Sandoval
Currently, mounting a subvolume with subvolid= takes a different code path than mounting with subvol=. This isn't really a big deal except for the fact that mounts done with subvolid= or the default subvolume don't have a dentry that's connected to the dentry tree like in the subvol= case. To unify the code paths, when given subvolid= or using the default subvolume ID, translate it into a subvolume name by walking ROOT_BACKREFs in the root tree and INODE_REFs in the filesystem trees. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: fail on mismatched subvol and subvolid mount optionsOmar Sandoval
There's nothing to stop a user from passing both subvol= and subvolid= to mount, but if they don't refer to the same subvolume, someone is going to be surprised at some point. Error out on this case, but allow users to pass in both if they do match (which they could, for example, get out of /proc/mounts). Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: clean up error handling in mount_subvol()Omar Sandoval
In preparation for new functionality in mount_subvol(), give it ownership of subvol_name and tidy up the error paths. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: remove all subvol options before mounting top-levelOmar Sandoval
Currently, setup_root_args() substitutes 's/subvol=[^,]*/subvolid=0/'. But, this means that if the user passes both a subvol and subvolid for some reason, we won't actually mount the top-level when we recursively mount. For example, consider: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt btrfs subvol create /mnt/subvol1 # subvolid=257 btrfs subvol create /mnt/subvol2 # subvolid=258 umount /mnt mount -osubvol=/subvol1,subvolid=258 /dev/sdb /mnt In the final mount, subvol=/subvol1,subvolid=258 becomes subvolid=0,subvolid=258, and the last option takes precedence, so we mount subvol2 and try to look up subvol1 inside of it, which fails. So, instead, do a thorough scan through the argument list and remove any subvol= and subvolid= options, then append subvolid=0 to the end. This implicitly makes subvol= take precedence over subvolid=, but we're about to add a stricter check for that. This also makes setup_root_args() more generic, which we'll need soon. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: lock superblock before remounting for rw subvolOmar Sandoval
Since commit 0723a0473fb4 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options"), when mounting a subvolume read/write when another subvolume has previously been mounted read-only, we first do a remount. However, this should be done with the superblock locked, as per sync_filesystem(): /* * We need to be protected against the filesystem going from * r/o to r/w or vice versa. */ WARN_ON(!rwsem_is_locked(&sb->s_umount)); This WARN_ON can easily be hit with: mkfs.btrfs -f /dev/vdb mount /dev/vdb /mnt btrfs subvol create /mnt/vol1 btrfs subvol create /mnt/vol2 umount /mnt mount -oro,subvol=/vol1 /dev/vdb /mnt mount -orw,subvol=/vol2 /dev/vdb /mnt2 Fixes: 0723a0473fb4 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options") Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: add 'cold' compiler annotations to all error handling functionsDavid Sterba
The annotated functios will be placed into .text.unlikely section. The annotation also hints compiler to move the code out of the hot paths, and may implicitly mark if-statement leading to that block as unlikely. This is a heuristic, the impact on the generated code is not significant. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: report exact callsite where transaction abort occursDavid Sterba
WARN is called from a single location and all bugreports say that's in super.c __btrfs_abort_transaction. This is slightly confusing as we'd rather want to know the exact callsite. Whereas this information is printed in the syslog below the stacktrace, this requires further look and we usually see only the headline from WARNING. Moving the WARN into the macro has to inline some code and increases code by a few kilobytes: text data bss dec hex filename 835481 20305 14120 869906 d4612 btrfs.ko.before 842883 20305 14120 877308 d62fc btrfs.ko.after The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config. The increase is not small and could lead to worse icache use. The code is on error/exit paths that can be recognized by compiler as cold and moved out of the way so the impact is speculated to be low, if measurable at all. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-26btrfs: cleanup orphans while looking up default subvolumeJeff Mahoney
Orphans in the fs tree are cleaned up via open_ctree and subvolume orphans are cleaned via btrfs_lookup_dentry -- except when a default subvolume is in use. The name for the default subvolume uses a manual lookup that doesn't trigger orphan cleanup and needs to trigger it manually as well. This doesn't apply to the remount case since the subvolumes are cleaned up by walking the root radix tree. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26btrfs: explicitly set control file's private_dataTom Van Braeckel
The private_data member of the Btrfs control device file (/dev/btrfs-control) is used to hold the current transaction and needs to be initialized to NULL to signify that no transaction is in progress. We explicitly set the control file's private_data to NULL to be independent of whatever value the misc subsystem initializes it to. Backstory: ---------- The misc subsystem (which is used by /dev/btrfs-control) initializes a file's private_data to point to the misc device when a driver has registered a custom open file operation and initializes it to NULL when a custom open file operation has *not* been provided. This subtle quirk is confusing, to the point where kernel code registers *empty* file open operations to have private_data point to the misc device structure. And it leads to bugs, where the addition or removal of a custom open file operation surprisingly changes the initial contents of a file's private_data structure. To simplify things in the misc subsystem, a patch [1] has been proposed to *always* set private_data to point to the misc device instead of only doing this when a custom open file operation has been registered. But before we can fix this in the misc subsystem itself, we need to modify the (few) drivers that rely on this very subtle behavior. [1] https://lkml.org/lkml/2014/12/4/939 Signed-off-by: Martin Kepplinger <martink@posteo.de> Signed-off-by: Tom Van Braeckel <tomvanbraeckel@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-25Merge branch 'cleanups-for-4.1-v2' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1
2015-03-03btrfs: cleanup 64bit/32bit divs, compile time constantsDavid Sterba
Switch to div_u64 if the divisor is a numeric constant or sum of sizeof()s. We can remove a few instances of do_div that has the hidden semtantics of changing the 1st argument. Small power-of-two divisors are converted to bitshifts, large values are kept intact for clarity. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is mostly cleanups and fixes: - The raid5/6 cleanups from Zhao Lei fixup some long standing warts in the code and add improvements on top of the scrubbing support from 3.19. - Josef has round one of our ENOSPC fixes coming from large btrfs clusters here at FB. - Dave Sterba continues a long series of cleanups (thanks Dave), and Filipe continues hammering on corner cases in fsync and others This all was held up a little trying to track down a use-after-free in btrfs raid5/6. It's not clear yet if this is just made easier to trigger with this pull or if its a new bug from the raid5/6 cleanups. Dave Sterba is the only one to trigger it so far, but he has a consistent way to reproduce, so we'll get it nailed shortly" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits) Btrfs: don't remove extents and xattrs when logging new names Btrfs: fix fsync data loss after adding hard link to inode Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group Btrfs: account for large extents with enospc Btrfs: don't set and clear delalloc for O_DIRECT writes Btrfs: only adjust outstanding_extents when we do a short write btrfs: Fix out-of-space bug Btrfs: scrub, fix sleep in atomic context Btrfs: fix scheduler warning when syncing log Btrfs: Remove unnecessary placeholder in btrfs_err_code btrfs: cleanup init for list in free-space-cache btrfs: delete chunk allocation attemp when setting block group ro btrfs: clear bio reference after submit_one_bio() Btrfs: fix scrub race leading to use-after-free Btrfs: add missing cleanup on sysfs init failure Btrfs: fix race between transaction commit and empty block group removal btrfs: add more checks to btrfs_read_sys_array btrfs: cleanup, rename a few variables in btrfs_read_sys_array btrfs: add checks for sys_chunk_array sizes btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize ...
2015-01-21btrfs: remove a no-op unfreeze superbock callbackDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-20btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.Qu Wenruo
Commit 6b5fe46dfa52 (btrfs: do commit in sync_fs if there are pending changes) will call btrfs_start_transaction() in sync_fs(), to handle some operations needed to be done in next transaction. However this can cause deadlock if the filesystem is frozen, with the following sys_r+w output: [ 143.255932] Call Trace: [ 143.255936] [<ffffffff816c0e09>] schedule+0x29/0x70 [ 143.255939] [<ffffffff811cb7f3>] __sb_start_write+0xb3/0x100 [ 143.255971] [<ffffffffa040ec06>] start_transaction+0x2e6/0x5a0 [btrfs] [ 143.255992] [<ffffffffa040f1eb>] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 143.256003] [<ffffffffa03dc0ba>] btrfs_sync_fs+0xca/0xd0 [btrfs] [ 143.256007] [<ffffffff811f7be0>] sync_fs_one_sb+0x20/0x30 [ 143.256011] [<ffffffff811cbd01>] iterate_supers+0xe1/0xf0 [ 143.256014] [<ffffffff811f7d75>] sys_sync+0x55/0x90 [ 143.256017] [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17 [ 143.256111] Call Trace: [ 143.256114] [<ffffffff816c0e09>] schedule+0x29/0x70 [ 143.256119] [<ffffffff816c3405>] rwsem_down_write_failed+0x1c5/0x2d0 [ 143.256123] [<ffffffff8133f013>] call_rwsem_down_write_failed+0x13/0x20 [ 143.256131] [<ffffffff811caae8>] thaw_super+0x28/0xc0 [ 143.256135] [<ffffffff811db3e5>] do_vfs_ioctl+0x3f5/0x540 [ 143.256187] [<ffffffff811db5c1>] SyS_ioctl+0x91/0xb0 [ 143.256213] [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17 The reason is like the following: (Holding s_umount) VFS sync_fs staff: |- btrfs_sync_fs() |- btrfs_start_transaction() |- sb_start_intwrite() (Waiting thaw_fs to unfreeze) VFS thaw_fs staff: thaw_fs() (Waiting sync_fs to release s_umount) So deadlock happens. This can be easily triggered by fstest/generic/068 with inode_cache mount option. The fix is to check if the fs is frozen, if the fs is frozen, just return and waiting for the next transaction. Cc: David Sterba <dsterba@suse.cz> Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> [enhanced comment, changed to SB_FREEZE_WRITE] Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19btrfs: sync ioctl, handle errors after transaction startDavid Sterba
The version merged to 3.19 did not handle errors from start_trancaction and could pass an invalid pointer to commit_transaction. Fixes: 6b5fe46dfa52441f ("btrfs: do commit in sync_fs if there are pending changes") Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02Btrfs: make btrfs_abort_transaction consider existence of new block groupsFilipe Manana
If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50 [25230.404330] ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4 [25230.404332] ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8 [25230.404334] Call Trace: [25230.404338] [<ffffffff813e7a13>] dump_stack+0x45/0x56 [25230.404342] [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98 [25230.404351] [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50 [25230.404362] [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [<ffffffff810f6501>] ? get_page+0x1a/0x2b [25230.404441] [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18 [25230.404449] [<ffffffff81138d68>] new_sync_write+0x7c/0xa0 [25230.404450] [<ffffffff81139401>] vfs_write+0xb0/0x112 [25230.404452] [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84 [25230.404454] [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25Merge branch 'dev/pending-changes' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
2014-11-20btrfs: fix wrong accounting of raid1 data profile in statfsDavid Sterba
The sizes that are obtained from space infos are in raw units and have to be adjusted according to the raid factor. This was missing for f_bavail and df reported doubled size for raid1. Reported-by: Martin Steigerwald <Martin@lichtvoll.de> Fixes: ba7b6e62f420 ("btrfs: adjust statfs calculations according to raid profiles") CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2Josef Bacik
Our gluster boxes get several thousand statfs() calls per second, which begins to suck hardcore with all of the lock contention on the chunk mutex and dev list mutex. We don't really need to hold these things, if we have transient weirdness with statfs() because of the chunk allocator we don't care, so remove this locking. We still need the dev_list lock if you mount with -o alloc_start however, which is a good argument for nuking that thing from orbit, but that's a patch for another day. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-11-12btrfs: switch inode_cache option handling to pending changesDavid Sterba
The pending mount option(s) now share namespace and bits with the normal options, and the existing one for (inode_cache) is unset unconditionally at each transaction commit. Introduce a separate namespace for pending changes and enhance the descriptions of the intended change to use separate bits for each action. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-12btrfs: do commit in sync_fs if there are pending changesDavid Sterba
If a pending change is requested, it's not processed unless there is a transaction commit about to happen, not even after sync or SYNC_FS ioctl. For example a remount that toggles the inode_cache option will not take effect after sync on a quiescent filesystem. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-27Btrfs: properly clean up btrfs_end_io_wq_cacheJosef Bacik
In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on unload, which makes us unable to unload and then re-load the btrfs module. This fixes the problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-08btrfs: Fix compile error when CONFIG_SECURITY is not set.Qu Wenruo
Fix the following compile error when CONFIG_SECURITY is not set: error: 'struct security_mnt_opts' has no member named 'num_mnt_opts' Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-06btrfs: Make btrfs handle security mount options internally to avoid losing ↵Qu Wenruo
security label. [BUG] Originally when mount btrfs with "-o subvol=" mount option, btrfs will lose all security lable. And if the btrfs fs is mounted somewhere else, due to the lost of security lable, SELinux will refuse to mount since the same super block is being mounted using different security lable. [REPRODUCER] With SELinux enabled: #mkfs -t btrfs /dev/sda5 #mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs #btrfs subvolume create /mnt/btrfs/subvol #mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/test kernel message: SELinux: mount invalid. Same superblock, different security settings for (dev sda5, type btrfs) [REASON] This happens because btrfs will call vfs_kern_mount() and then mount_subtree() to handle subvolume name lookup. First mount will cut off all the security lables and when it comes to the second vfs_kern_mount(), it has no security label now. [FIX] This patch will makes btrfs behavior much more like nfs, which has the type flag FS_BINARY_MOUNTDATA, making btrfs handles the security label internally. So security label will be set in the real mount time and won't lose label when use with "subvol=" mount option. Reported-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-10-02btrfs: use slab for end_io_wq structuresDavid Sterba
The structure is frequently reused. Rename it according to the slab name. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02btrfs: fix error labels in init_btrfs_fsDavid Sterba
btrfs_interface_init rarely fails but we could leak the prelim_ref slab. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02Btrfs: set default max_inline to 8KiB instead of 8MiBFilipe David Borba Manana
8MiB is way too large and likely set by mistake. This is not a significant issue as in practice the max amount of data added to an inline extent is also limited by the page cache and btree leaf sizes. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-01btrfs: remove unused variable from btrfs_parse_optionsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-09-17Btrfs: fix unprotected device list access when getting the fs informationMiao Xie
When we get the fs information, we forgot to acquire the mutex of device list, it might cause the problem we might access a device that was removed. Fix it by acquiring the device list mutex. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17btrfs: add trace for qgroup accountingMark Fasheh
We want this to debug qgroup changes on live systems. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17Btrfs: clear compress-force when remounting with compress optionWang Shilong
Steps to reproduce: # mkfs.btrfs -f /dev/sdb # mount /dev/sdb /mnt -o compress-force=lzo # mount /dev/sdb /mnt -o remount,compress=zlib # cat /proc/mounts Remounting from compress-force to compress could not clear compress-force option. The problem is there is no way for users to clear compress-force option separately. Fix this problem by clearing @FORCE_COMPRESS flag when remounting to compress=xxx. Suggested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17btrfs: make close_ctree return voidDavid Sterba
There's no user of the return value and we can get rid of the comment in put_super. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-08-16Merge branch 'for-linus2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "These are all fixes I'd like to get out to a broader audience. The biggest of the bunch is Mark's quota fix, which is also in the SUSE kernel, and makes our subvolume quotas dramatically more accurate. I've been running xfstests with these against your current git overnight, but I'm queueing up longer tests as well" * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: disable strict file flushes for renames and truncates Btrfs: fix csum tree corruption, duplicate and outdated checksums Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch Btrfs: fix compressed write corruption on enospc btrfs: correctly handle return from ulist_add btrfs: qgroup: account shared subtrees during snapshot delete Btrfs: read lock extent buffer while walking backrefs Btrfs: __btrfs_mod_ref should always use no_quota btrfs: adjust statfs calculations according to raid profiles
2014-08-15btrfs: adjust statfs calculations according to raid profilesDavid Sterba
This has been discussed in thread: http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528 and this patch implements this proposal: http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536 Works fine for "clean" raid profiles where the raid factor correction does the right job. Otherwise it's pessimistic and may show low space although there's still some left. The df nubmers are lightly wrong in case of mixed block groups, but this is not a major usecase and can be addressed later. The RAID56 numbers are wrong almost the same way as before and will be addressed separately. CC: Hugo Mills <hugo@carfax.org.uk> CC: cwillu <cwillu@cwillu.com> CC: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-08-07dcache: d_obtain_alias callers don't all want DISCONNECTEDJ. Bruce Fields
There are a few d_obtain_alias callers that are using it to get the root of a filesystem which may already have an alias somewhere else. This is not the same as the filehandle-lookup case, and none of them actually need DCACHE_DISCONNECTED set. It isn't really a serious problem, but it would really be clearer if we reserved DCACHE_DISCONNECTED for those cases where it's actually needed. In the btrfs case this was causing a spurious printk from nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol", and this replaces that workaround. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-07-03btrfs: fix null pointer dereference in btrfs_show_devname when name is nullAnand Jain
dev->name is null but missing flag is not set. Strictly speaking the missing flag should have been set, but there are more places where code just checks if name is null. For now this patch does the same. stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000064 IP: [<ffffffffa0228908>] btrfs_show_devname+0x58/0xf0 [btrfs] [<ffffffff81198879>] show_vfsmnt+0x39/0x130 [<ffffffff81178056>] m_show+0x16/0x20 [<ffffffff8117d706>] seq_read+0x296/0x390 [<ffffffff8115aa7d>] vfs_read+0x9d/0x160 [<ffffffff8115b549>] SyS_read+0x49/0x90 [<ffffffff817abe52>] system_call_fastpath+0x16/0x1b reproducer: mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2 btrfstune -S 1 /dev/sdg1 modprobe -r btrfs && modprobe btrfs mount -o degraded /dev/sdg1 /btrfs btrfs dev add /dev/sdg3 /btrfs Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03btrfs: fix nossd and ssd_spread mount option regressionEric Sandeen
The commit 0780253 btrfs: Cleanup the btrfs_parse_options for remount. broke ssd options quite badly; it stopped making ssd_spread imply ssd, and it made "nossd" unsettable. Put things back at least as well as they were before (though ssd mount option handling is still pretty odd: # mount -o "nossd,ssd_spread" works?) Reported-by: Roman Mamedov <rm@romanrm.net> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03Btrfs: fix race between balance recovery and root deletionWang Shilong
Balance recovery is called when RW mounting or remounting from RO to RW, it is called to finish roots merging. When doing balance recovery, relocation root's corresponding fs root(whose root refs is 0) might be destroyed by cleaner thread, this will make btrfs fail to mount. Fix this problem by holding @cleaner_mutex when doing balance recovery. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09btrfs: remove stale newlines from log messagesDavid Sterba
I've noticed an extra line after "use no compression", but search revealed much more in messages of more critical levels and rare errors. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09Btrfs: remove OPT_acl parse when acl disabledGuangliang Zhao
Even CONFIG_BTRFS_FS_POSIX_ACL is not defined, the acl still could been enabled using a mount option, and now fs/btrfs/acl.o is not built, so the mount options will appear to be supported but will be silently ignored. Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09Btrfs: add sanity tests for new qgroup accounting codeJosef Bacik
This exercises the various parts of the new qgroup accounting code. We do some basic stuff and do some things with the shared refs to make sure all that code works. I had to add a bunch of infrastructure because I needed to be able to insert items into a fake tree without having to do all the hard work myself, hopefully this will be usefull in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09Btrfs: reclaim the reserved metadata space at backgroundMiao Xie
Before applying this patch, the task had to reclaim the metadata space by itself if the metadata space was not enough. And When the task started the space reclamation, all the other tasks which wanted to reserve the metadata space were blocked. At some cases, they would be blocked for a long time, it made the performance fluctuate wildly. So we introduce the background metadata space reclamation, when the space is about to be exhausted, we insert a reclaim work into the workqueue, the worker of the workqueue helps us to reclaim the reserved space at the background. By this way, the tasks needn't reclaim the space by themselves at most cases, and even if the tasks have to reclaim the space or are blocked for the space reclamation, they will get enough space more quickly. Here is my test result(Tested by compilebench): Memory: 2GB CPU: 2Cores * 1CPU Partition: 40GB(SSD) Test command: # compilebench -D <mnt> -m Without this patch: intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s) compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s) read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s) delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s) With this patch: intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s) compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s) read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s) delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s) Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>