summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2013-07-19Btrfs: fix wrong write offset when replacing a deviceStefan Behrens
Miao Xie reported the following issue: The filesystem was corrupted after we did a device replace. Steps to reproduce: # mkfs.btrfs -f -m single -d raid10 <device0>..<device3> # mount <device0> <mnt> # btrfs replace start -rfB 1 <device4> <mnt> # umount <mnt> # btrfsck <device4> The reason for the issue is that we changed the write offset by mistake, introduced by commit 625f1c8dc. We read the data from the source device at first, and then write the data into the corresponding place of the new device. In order to implement the "-r" option, the source location is remapped using btrfs_map_block(). The read takes place on the mapped location, and the write needs to take place on the unmapped location. Currently the write is using the mapped location, and this commit changes it back by undoing the change to the write address that the aforementioned commit added by mistake. Reported-by: Miao Xie <miaox@cn.fujitsu.com> Cc: <stable@vger.kernel.org> # 3.10+ Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19Btrfs: re-add root to dead root list if we stop dropping itJosef Bacik
If we stop dropping a root for whatever reason we need to add it back to the dead root list so that we will re-start the dropping next transaction commit. The other case this happens is if we recover a drop because we will add a root without adding it to the fs radix tree, so we can leak it's root and commit root extent buffer, adding this to the dead root list makes this cleanup happen. Thanks, Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19Btrfs: fix lock leak when resuming snapshot deletionJosef Bacik
We aren't setting path->locks[level] when we resume a snapshot deletion which means we won't unlock the buffer when we free the path. This causes deadlocks if we happen to re-allocate the block before we've evicted the extent buffer from cache. Thanks, Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19Btrfs: update drop progress before stopping snapshot droppingJosef Bacik
Alex pointed out a problem and fix that exists in the drop one snapshot at a time patch. If we decide we need to exit for whatever reason (umount for example) we will just exit the snapshot dropping without updating the drop progress. So the next time we go to resume we will BUG_ON() because we can't find the extent we left off at because we never updated it. This patch fixes the problem. Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "These are the usual mixture of bugs, cleanups and performance fixes. Miao has some really nice tuning of our crc code as well as our transaction commits. Josef is peeling off more and more problems related to early enospc, and has a number of important bug fixes in here too" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits) Btrfs: wait ordered range before doing direct io Btrfs: only do the tree_mod_log_free_eb if this is our last ref Btrfs: hold the tree mod lock in __tree_mod_log_rewind Btrfs: make backref walking code handle skinny metadata Btrfs: fix crash regarding to ulist_add_merge Btrfs: fix several potential problems in copy_nocow_pages_for_inode Btrfs: cleanup the code of copy_nocow_pages_for_inode() Btrfs: fix oops when recovering the file data by scrub function Btrfs: make the chunk allocator completely tree lockless Btrfs: cleanup orphaned root orphan item Btrfs: fix wrong mirror number tuning Btrfs: cleanup redundant code in btrfs_submit_direct() Btrfs: remove btrfs_sector_sum structure Btrfs: check if we can nocow if we don't have data space Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc Btrfs: use a percpu to keep track of possibly pinned bytes Btrfs: check for actual acls rather than just xattrs when caching no acl Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate Btrfs: optimize reada_for_balance Btrfs: optimize read_block_for_search ...
2013-07-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) treewide: relase -> release Documentation/cgroups/memory.txt: fix stat file documentation sysctl/net.txt: delete reference to obsolete 2.4.x kernel spinlock_api_smp.h: fix preprocessor comments treewide: Fix typo in printk doc: device tree: clarify stuff in usage-model.txt. open firmware: "/aliasas" -> "/aliases" md: bcache: Fixed a typo with the word 'arithmetic' irq/generic-chip: fix a few kernel-doc entries frv: Convert use of typedef ctl_table to struct ctl_table sgi: xpc: Convert use of typedef ctl_table to struct ctl_table doc: clk: Fix incorrect wording Documentation/arm/IXP4xx fix a typo Documentation/networking/ieee802154 fix a typo Documentation/DocBook/media/v4l fix a typo Documentation/video4linux/si476x.txt fix a typo Documentation/virtual/kvm/api.txt fix a typo Documentation/early-userspace/README fix a typo Documentation/video4linux/soc-camera.txt fix a typo lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment ...
2013-07-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...
2013-07-03vfs: export lseek_execute() to modulesJie Liu
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support SEEK_DATA/SEEK_HOLE functions, we end up handling the similar matter in lseek_execute() to update the current file offset to the desired offset if it is valid, ceph also does the simliar things at ceph_llseek(). To reduce the duplications, this patch make lseek_execute() public accessible so that we can call it directly from the underlying file systems. Thanks Dave Chinner for this suggestion. [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back] v2->v1: - Add kernel-doc comments for lseek_execute() - Call lseek_execute() in ceph->llseek() Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Ben Myers <bpm@sgi.com> Cc: Ted Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...
2013-07-02Btrfs: wait ordered range before doing direct ioJosef Bacik
My recent truncate patch uncovered this bug, but I can reproduce it without the truncate patch. If you mount with -o compress-force, do a direct write to some area, do a buffered write to some other area, and then do a direct read you will get the wrong data for where you did the buffered write. This is because the generic direct io helpers only call filemap_write_and_wait once, and for compression we need it twice. So to be safe add the btrfs_wait_ordered_range to the start of the direct io function to make sure any compressed writes have truly been written. This patch makes xfstests 130 pass when you mount with -o compress-force=lzo. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: only do the tree_mod_log_free_eb if this is our last refJosef Bacik
There is another bug in the tree mod log stuff in that we're calling tree_mod_log_free_eb every single time a block is cow'ed. The problem with this is that if this block is shared by multiple snapshots we will call this multiple times per block, so if we go to rewind the mod log for this block we'll BUG_ON() in __tree_mod_log_rewind because we try to rewind a free twice. We only want to call tree_mod_log_free_eb if we are actually freeing the block. With this patch I no longer hit the panic in __tree_mod_log_rewind. Thanks, Cc: stable@vger.kernel.org Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: hold the tree mod lock in __tree_mod_log_rewindJosef Bacik
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk forward in the tree mod entries, otherwise we'll end up with random entries and trip the BUG_ON() at the front of __tree_mod_log_rewind. This fixes the panics people were seeing when running find /whatever -type f -exec btrfs fi defrag {} \; Thansk, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: make backref walking code handle skinny metadataJosef Bacik
I missed fixing the backref stuff when I introduced the skinny metadata. If you try and do things like snapshot aware defrag with skinny metadata you are going to see tons of warnings related to the backref count being less than 0. This is because the delayed refs will be found for stuff just fine, but it won't find the skinny metadata extent refs. With this patch I'm not seeing warnings anymore. Thanks, Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: fix crash regarding to ulist_add_mergeLiu Bo
Several users reported this crash of NULL pointer or general protection, the story is that we add a rbtree for speedup ulist iteration, and we use krealloc() to address ulist growth, and krealloc() use memcpy to copy old data to new memory area, so it's OK for an array as it doesn't use pointers while it's not OK for a rbtree as it uses pointers. So krealloc() will mess up our rbtree and it ends up with crash. Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: fix several potential problems in copy_nocow_pages_for_inodeMiao Xie
- It makes no sense that we deal with a inode in the dead tree. - fix the race between dio and page copy by waiting the dio completion - avoid the page copy vs truncate/punch hole - check if the page is in the page cache or not Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: cleanup the code of copy_nocow_pages_for_inode()Miao Xie
- It make no sense that we continue to do something after the error happened, just go back with this patch. - remove some check of copy_nocow_pages_for_inode(), such as page check after write, inode check in the end of the function, because we are sure they exist. - remove the unnecessary goto in the return value check of the write Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: fix oops when recovering the file data by scrub functionMiao Xie
We get oops while running btrfs replace start test, ------------[ cut here ]------------ kernel BUG at mm/filemap.c:608! [SNIP] Call Trace: [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs] [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs] [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs] [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0 [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs] [<ffffffff8106f2f0>] kthread+0xc0/0xd0 [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80 [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0 [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80 [SNIP] RIP [<ffffffff8111f4c5>] unlock_page+0x35/0x40 RSP <ffff88010316bb98> ---[ end trace 421e79ad0dd72c7d ]--- it is because we forgot to lock the page again after we read data to the page. Fix it. Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: make the chunk allocator completely tree locklessJosef Bacik
When adjusting the enospc rules for relocation I ran into a deadlock because we were relocating the only system chunk and that forced us to try and allocate a new system chunk while holding locks in the chunk tree, which caused us to deadlock. To fix this I've moved all of the dev extent addition and chunk addition out to the delayed chunk completion stuff. We still keep the in-memory stuff which makes sure everything is consistent. One change I had to make was to search the commit root of the device tree to find a free dev extent, and hold onto any chunk em's that we allocated in that transaction so we do not allocate the same dev extent twice. This has the side effect of fixing a bug with balance that has been there ever since balance existed. Basically you can free a block group and it's dev extent and then immediately allocate that dev extent for a new block group and write stuff to that dev extent, all within the same transaction. So if you happen to crash during a balance you could come back to a completely broken file system. This patch should keep these sort of things from happening in the future since we won't be able to allocate free'd dev extents until after the transaction commits. This has passed all of the xfstests and my super annoying stress test followed by a balance. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: cleanup orphaned root orphan itemJosef Bacik
I hit a weird problem were my root item had been deleted but the orphan item had not. This isn't necessarily a problem, but it keeps the file system from being mounted. To fix this we just need to axe the orphan item if we can't find the fs root when we're putting them altogether. With this patch I was able to successfully mount my file system. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: fix wrong mirror number tuningMiao Xie
Now reading the data from the target device of the replace operation is allowed, so the mirror number that is greater than the stripes number of a chunk is valid, we will tune it when we find there is no target device later. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: cleanup redundant code in btrfs_submit_direct()Miao Xie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: remove btrfs_sector_sum structureMiao Xie
Using the structure btrfs_sector_sum to keep the checksum value is unnecessary, because the extents that btrfs_sector_sum points to are continuous, we can find out the expected checksums by btrfs_ordered_sum's bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After removing bytenr, there is only one member in the structure, so it makes no sense to keep the structure, just remove it, and use a u32 array to store the checksum value. By this change, we don't use the while loop to get the checksums one by one. Now, we can get several checksum value at one time, it improved the performance by ~74% on my SSD (31MB/s -> 54MB/s). test command: # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: check if we can nocow if we don't have data spaceJosef Bacik
We always just try and reserve data space when we write, but if we are out of space but have prealloc'ed extents we should still successfully write. This patch will try and see if we can write to prealloc'ed space and if we can go ahead and allow the write to continue. With this patch we now pass xfstests generic/274. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delallocJosef Bacik
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which is completely fraking useless for us as we need to make sure pages are actually written before we go and check if there are ordered extents. So replace this with an open coding of try_to_writeback_inodes_sb_nr minus the writeback underway check so that we are sure to actually have flushed some dirty pages out and will have ordered extents to use. With this patch xfstests generic/273 now passes. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: use a percpu to keep track of possibly pinned bytesJosef Bacik
There are all of these checks in the ENOSPC code to see if committing the transaction would free up enough space to make the allocation. This is because early on we just committed the transaction and hoped and prayed, which resulted in cases where it took _forever_ to get an ENOSPC when we really were out of space. So we check space_info->bytes_pinned, except this isn't completely true because it doesn't account for space we may free but are stuck in delayed refs. So tests like xfstests 226 would fail because we wouldn't commit the transaction to free up the data space. So instead add a percpu counter that will be a little fuzzier, it will add bytes as soon as we try to free up the space, and remove any space it doesn't actually free up when we get around to doing the actual free. We then 0 out this counter every transaction period so we have a better idea of how much space we will actually free up by committing this transaction. With this patch we now pass xfstests 226. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02Btrfs: check for actual acls rather than just xattrs when caching no aclJosef Bacik
We have an optimization that will go ahead and cache no acls on an inode if there are no xattrs on the inode. This saves us a lookup later to check the acls for writes or any other access. The problem is I use selinux so I always have an xattr on inodes, so make this test a little smarter and check for the actual acl hash on the key and if it isn't there then we still get to cache no acl which makes everybody who uses selinux a little happier. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncateJosef Bacik
This has plagued us forever and I'm so over working around it. When we truncate down to a non-page aligned offset we will call btrfs_truncate_page to zero out the end of the page and write it back to disk, this will keep us from exposing stale data if we truncate back up from that point. The problem with this is it requires data space to do this, and people don't really expect to get ENOSPC from truncate() for these sort of things. This also tends to bite the orphan cleanup stuff too which keeps people from mounting. To get around this we can just move this into btrfs_cont_expand() to make sure if we are truncating up from a non-page size aligned i_size we will zero out the rest of this page so that we don't expose stale data. This will give ENOSPC if you try to truncate() up or if you try to write past the end of isize, which is much more reasonable. This fixes xfstests generic/083 failing to mount because of the orphan cleanup failing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: optimize reada_for_balanceJosef Bacik
This patch does two things. First we no longer explicitly read in the blocks we're trying to readahead. For things like balance_level we may never actually use the blocks so this just adds uneeded latency, and balance_level and split_node will both read in the blocks they care about explicitly so if the blocks need to be waited on it will be done there. Secondly we no longer drop the path if we do readahead, we just set the path blocking before we call reada_for_balance() and then we're good to go. Hopefully this will cut down on the number of re-searches. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: optimize read_block_for_searchJosef Bacik
This patch does two things, first it only does one call to btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then again with gen specified. The other thing is to call btrfs_read_buffer() on the buffer we've found instead of dropping it and then calling read_tree_block(). This will keep us from doing yet another radix tree lookup for a buffer we've already found. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: unlock extent range on enospc in compressed submitJosef Bacik
A user reported a deadlock where the async submit thread was blocked on the lock_extent() lock, and then everybody behind him was locked on the page lock for the page he was holding. Looking at the code I noticed we do not unlock the extent range when we get ENOSPC and goto retry. This is bad because we immediately try to lock that range again to do the cow, which will cause a deadlock. Fix this by unlocking the range. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: fix the comment typo for btrfs_attach_transaction_barrierWang Sheng-Hui
The comment is for btrfs_attach_transaction_barrier, not for btrfs_attach_transaction. Fix the typo. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Acked-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: fix not being able to find skinny extents during relocateJosef Bacik
We unconditionally search for the EXTENT_ITEM_KEY for metadata during balance, and then check the key that we found to see if it is actually a METADATA_ITEM_KEY, but this doesn't work right because METADATA is a higher key value, so if what we are looking for happens to be the first item in the leaf the search will dump us out at the previous leaf, and we won't find our item. So instead do what we do everywhere else, search for the skinny extent first and if we don't find it go back and re-search for the extent item. This patch fixes the panic I was hitting when balancing a large file system with skinny extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: cleanup backref search commit root flag stuffJosef Bacik
Looking into this backref problem I noticed we're using a macro to what turns out to essentially be a NULL check to see if we need to search the commit root. I'm killing this, let's just do what everybody else does and checks if trans == NULL. I've also made it so we pass in the path to __resolve_indirect_refs which will have the search_commit_root flag set properly already and that way we can avoid allocating another path when we have a perfectly good one to use. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: free csums when we're done scrubbing an extentJosef Bacik
A user reported scrub taking up an unreasonable amount of ram as it ran. This is because we lookup the csums for the extent we're scrubbing but don't free it up until after we're done with the scrub, which means we can take up a whole lot of ram. This patch fixes this by dropping the csums once we're done with the extent we've scrubbed. The user reported this to fix their problem. Thanks, Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: fix transaction throttling for delayed refsJosef Bacik
Dave has this fs_mark script that can make btrfs abort with sufficient amount of ram. This is because with more ram we can keep more dirty metadata in cache which in a round about way makes for many more pending delayed refs. What happens is we end up not throttling the transaction enough so when we go to commit the transaction when we've completely filled the file system we'll abort() because we use all of the space in the global reserve and we still have delayed refs to run. To fix this we need to make the delayed ref flushing and the transaction throttling dependant upon the number of delayed refs that we have instead of how much reserved space is left in the global reserve. With this patch we not only stop aborting transactions but we also get a smoother run speed with fs_mark and it makes us about 10% faster. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: stop waiting on current trans if we abortedJosef Bacik
I hit a hang when run_delayed_refs returned an error in the beginning of btrfs_commit_transaction. If we decide we need to commit the transaction in btrfs_end_transaction we'll set BLOCKED and start to commit, but if we get an error this early on we'll just exit without committing. This is fine, except that anybody else who tried to start a transaction will sit in wait_current_trans() since we're set to BLOCKED and we never set it to something else and woke people up. To fix this we want to check for trans->aborted everywhere we wait for the transaction state to change, and make btrfs_abort_transaction() wake up any waiters there may be. All the callers will notice that the transaction has aborted and exit out properly. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: wake up delayed ref flushing waiters on abortJosef Bacik
I hit a deadlock because we aborted when flushing delayed refs but didn't wake any of the other flushers up and so everybody was just sleeping forever. This should fix the problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01btrfs: fix the code comments for LZO compression workspaceJie Liu
Fix the code comments for lzo compression workspace. The buf item is used to store the decompressed data and cbuf is used to store the compressed data. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01Btrfs: fix broken nocow after balanceMiao Xie
Balance will create reloc_root for each fs root, and it's going to record last_snapshot to filter shared blocks. The side effect of setting last_snapshot is to break nocow attributes of files. Since the extents are not shared by the relocation tree after the balance, we can recover the old last_snapshot safely if no one snapshoted the source tree. We fix the above problem by this way. Reported-by: Kyle Gates <kylegates@hotmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-29btrfs: more open-coded file_inode()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29[readdir] convert btrfsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-14Btrfs: exclude logged extents before replying when we are mixedJosef Bacik
With non-mixed block groups we replay the logs before we're allowed to do any writes, so we get away with not pinning/removing the data extents until right when we replay them. However with mixed block groups we allocate out of the same pool, so we could easily allocate a metadata block that was logged in our tree log. To deal with this we just need to notice that we have mixed block groups and do the normal excluding/removal dance during the pin stage of the log replay and that way we don't allocate metadata blocks from areas we have logged data extents. With this patch we now pass xfstests generic/311 with mixed block groups turned on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: put our inode if orphan cleanup failsJosef Bacik
When we cross into a different subvol when doing a lookup we will run the orhpan cleanup. If this fails however we do not drop the ref to the inode we were looking up before we return an error, which leads to busy inodes on umount. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: add some missing iput()'s in btrfs_orphan_cleanupJosef Bacik
There are some error cases that we don't do an iput() on our inode, fix this. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: do not pin while under spin lockJosef Bacik
When testing a corrupted fs I noticed I was getting sleep while atomic errors when the transaction aborted. This is because btrfs_pin_extent may need to allocate memory and we are calling this under the spin lock. Fix this by moving it out and doing the pin after dropping the spin lock but before dropping the mutex, the same way it works when delayed refs run normally. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: Cocci spatch "memdup.spatch"Thomas Meyer
Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: Cocci spatch "ptr_ret.spatch"Thomas Meyer
Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: fix qgroup rescan resume on mountJan Schmidt
When called during mount, we cannot start the rescan worker thread until open_ctree is done. This commit restuctures the qgroup rescan internals to enable a clean deferral of the rescan resume operation. First of all, the struct qgroup_rescan is removed, saving us a malloc and some initialization synchronizations problems. Its only element (the worker struct) now lives within fs_info just as the rest of the rescan code. Then setting up a rescan worker is split into several reusable stages. Currently we have three different rescan startup scenarios: (A) rescan ioctl (B) rescan resume by mount (C) rescan by quota enable Each case needs its own combination of the four following steps: (1) set the progress [A, C: zero; B: state of umount] (2) commit the transaction [A] (3) set the counters [A, C: zero; B: state of umount] (4) start worker [A, B, C] qgroup_rescan_init does step (1). There's no extra function added to commit a transaction, we've got that already. qgroup_rescan_zero_tracking does step (3). Step (4) is nothing more than a call to the generic btrfs_queue_worker. We also get rid of a double check for the rescan progress during btrfs_qgroup_account_ref, which is no longer required due to having step 2 from the list above. As a side effect, this commit prepares to move the rescan start code from btrfs_run_qgroups (which is run during commit) to a less time critical section. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: avoid double free of fs_info->qgroup_ulistJan Schmidt
When btrfs_read_qgroup_config or btrfs_quota_enable return non-zero, we've already freed the fs_info->qgroup_ulist. The final btrfs_free_qgroup_config called from quota_disable makes another ulist_free(fs_info->qgroup_ulist) call. We set fs_info->qgroup_ulist to NULL on the mentioned error paths, turning the ulist_free in btrfs_free_qgroup_config into a noop. Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14Btrfs: fix memory patcher through fs_info->qgroup_ulistJan Schmidt
Commit 5b7c665e introduced fs_info->qgroup_ulist, that is allocated during btrfs_read_qgroup_config and meant to be used later by the qgroup accounting code. However, it is always freed before btrfs_read_qgroup_config returns, becuase the commit mentioned above adds a check for (ret), where a check for (ret < 0) would have been the right choice. This commit fixes the check. Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>