summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/extent-tree.c
AgeCommit message (Collapse)Author
2012-08-28Btrfs: allow delayed refs to be mergedJosef Bacik
Daniel Blueman reported a bug with fio+balance on a ramdisk setup. Basically what happens is the balance relocates a tree block which will drop the implicit refs for all of its children and adds a full backref. Once the block is relocated we have to add the implicit refs back, so when we cow the block again we add the implicit refs for its children back. The problem comes when the original drop ref doesn't get run before we add the implicit refs back. The delayed ref stuff will specifically prefer ADD operations over DROP to keep us from freeing up an extent that will have references to it, so we try to add the implicit ref before it is actually removed and we panic. This worked fine before because the add would have just canceled the drop out and we would have been fine. But the backref walking work needs to be able to freeze the delayed ref stuff in time so we have this ever increasing sequence number that gets attached to all new delayed ref updates which makes us not merge refs and we run into this issue. So to fix this we need to merge delayed refs. So everytime we run a clustered ref we need to try and merge all of its delayed refs. The backref walking stuff locks the delayed ref head before processing, so if we have it locked we are safe to merge any refs inside of the sequence number. If there is no sequence number we can merge all refs. Doing this not only fixes our bug but keeps the delayed ref code from adding and removing useless refs and batching together multiple refs into one search instead of one search per delayed ref, which will really help our commit times. I ran this with Daniels test and 276 and I haven't seen any problems. Thanks, Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28Btrfs: fix race in run_clustered_refsArne Jansen
With commit commit d1270cd91f308c9d22b2804720c36ccd32dbc35e Author: Arne Jansen <sensille@gmx.net> Date: Tue Sep 13 15:16:43 2011 +0200 Btrfs: put back delayed refs that are too new I added a window where the delayed_ref's head->ref_mod code can diverge from the sum of the remaining refs, because we release the head->mutex in the middle. This leads to btrfs_lookup_extent_info returning wrong numbers. This patch fixes this by adjusting the head's ref_mod with each delayed ref we run. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28Btrfs: increase the size of the free space cacheJosef Bacik
Arne was complaining about the space cache having mismatching generation numbers when debugging a deadlock. This is because we can run out of space in our preallocated range for our space cache if you have a pretty fragmented amount of space in your pinned space. So just increase the amount of space we preallocate for space cache so we can be sure to have enough space. This will only really affect data ranges since their the only chunks that end up larger than 256MB. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28Btrfs: fix deadlock in wait_for_more_refsArne Jansen
Commit a168650c introduced a waiting mechanism to prevent busy waiting in btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations, where a tree_mod_seq is held while waiting for the io to complete, while the end_io calls btrfs_run_delayed_refs. This whole mechanism is unnecessary. If not enough runnable refs are available to satisfy count, just return as count is more like a guideline than a strict requirement. In case we have to run all refs, commit transaction makes sure that no other threads are working in the transaction anymore, so we just assert here that no refs are blocked. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28Btrfs: unlock on error in btrfs_delalloc_reserve_metadata()Dan Carpenter
We should release this mutex before returning the error code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-25Btrfs: add a barrier before a waitqueue_active checkChris Mason
We were missing wakeups on the delayed ref waitqueue due to races on waitqueue_active. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25Merge branch 'qgroup' of git://git.jan-o-sch.net/btrfs-unstable into for-linusChris Mason
Conflicts: fs/btrfs/ioctl.c fs/btrfs/ioctl.h fs/btrfs/transaction.c fs/btrfs/transaction.h Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23Btrfs: make btrfs's allocation smoothly with preallocationLiu Bo
For backref walking, we've introduce delayed ref's sequence. However, it changes our preallocation behavior. The story is that when we preallocate an extent and then mark it written piece by piece, the ideal case should be that we don't need to COW the extent, which is why we use 'preallocate'. But we may not make use of preallocation, since when we check for cross refs on the extent, we may have two ref entries which have the same content except the sequence value, and we recognize them as cross refs and do COW to allocate another extent. So we end up with several pieces of space instead of an whole extent. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23Btrfs: kill free_space pointer from inode structureLi Zefan
Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type, which means every inode has the same BTRFS_I(inode)->free_space pointer. This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits). Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23Btrfs: add ro notification to dump_space_infoLiu Bo
Block group has ro attributes, make dump_space_info show it. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23Btrfs: fix a bug of writting free space cache during balanceLiu Bo
Here is the whole story: 1) A free space cache consists of two parts: o free space cache inode, which is special becase it's stored in root tree. o free space info, which is stored as the above inode's file data. But we only build up another new inode and does not flush its free space info onto disk when we _clear and setup_ free space cache, and this ends up with that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN. And holding DC_SETUP means that we will not truncate this free space cache inode, which means the disk offset of its file extent will remain _unchanged_ at least until next transaction finishes committing itself. 2) We can set a block group readonly when we relocate the block group. However, if the readonly block group covers the disk offset where our free space cache inode is going to write, it will force the free space cache inode into cow_file_range() and it'll end up hitting a BUG_ON. 3) Due to the above analysis, we fix this bug by adding the missing dirty flag. 4) However, it's not over, there is still another case, nospace_cache. With nospace_cache, we do not want to set dirty flag, instead we just truncate free space cache inode and bail out with setting cache state DC_WRITTEN. We can benifit from it since it saves us another 'pre-allocation' part which usually costs a lot. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23Btrfs: do not abort transaction in prealloc caseLiu Bo
During disk balance, we prealloc new file extent for file data relocation, but we may fail in 'no available space' case, and it leads to flipping btrfs into readonly. It is not necessary to bail out and abort transaction since we do have several ways to rescue ourselves from ENOSPC case. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23Btrfs: kill root from btrfs_is_free_space_inodeLiu Bo
Since root can be fetched via BTRFS_I macro directly, we can save an args for btrfs_is_free_space_inode(). Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23Btrfs: rework shrink_delallocJosef Bacik
So shrink_delalloc has grown all sorts of cruft over the years thanks to many reworkings of how we track enospc. What happens now as we fill up the disk is we will loop for freaking ever hoping to reclaim a arbitrary amount of space of metadata, this was from when everybody flushed at the same time. Now we only have people flushing one at a time. So instead of trying to reclaim a huge amount of space, just try to flush a decent chunk of space, and stop looping as soon as we have enough free space to satisfy our reservation. This makes xfstests 224 go much faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23Btrfs: change how we indicate we're adding csumsJosef Bacik
There is weird logic I had to put in place to make sure that when we were adding csums that we'd used the delalloc block rsv instead of the global block rsv. Part of this meant that we had to free up our transaction reservation before we ran the delayed refs since csum deletion happens during the delayed ref work. The problem with this is that when we release a reservation we will add it to the global reserve if it is not full in order to keep us going along longer before we have to force a transaction commit. By releasing our reservation before we run delayed refs we don't get the opportunity to drain down the global reserve for the work we did, so we won't refill it as often. This isn't a problem per-se, it just results in us possibly committing transactions more and more often, and in rare cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran out of space in our block rsv. This also helps us by holding onto space while the delayed refs run so we don't end up with as many people trying to do things at the same time, which again will help us not force commits or hit the use_block_rsv warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23Btrfs: flush delayed inodes if we're short on spaceJosef Bacik
Those crazy gentoo guys have been complaining about ENOSPC errors on their portage volumes. This is because doing things like untar tends to create lots of new files which will soak up all the reservation space in the delayed inodes. Usually this gets papered over by the fact that we will try and commit the transaction, however if this happens in the wrong spot or we choose not to commit the transaction you will be screwed. So add the ability to expclitly flush delayed inodes to free up space. Please test this out guys to make sure it works since as usual I cannot reproduce. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-12Btrfs: hooks to reserve qgroup spaceArne Jansen
Like block reserves, reserve a small piece of space on each transaction start and for delalloc. These are the hooks that can actually return EDQUOT to the user. The amount of space reserved is tracked in the transaction handle. Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-12Btrfs: call the qgroup accounting functionsJan Schmidt
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-12Btrfs: qgroup implementation and prototypesArne Jansen
Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-10Btrfs: Test code to change the order of delayed-ref processingArne Jansen
Normally delayed refs get processed in ascending bytenr order. This correlates in most cases to the order added. To expose dependencies on this order, we start to process the tree in the middle instead of the beginning. This code is only effective when SCRAMBLE_DELAYED_REFS is defined. Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-10Btrfs: join tree mod log code with the code holding back delayed refsJan Schmidt
We've got two mechanisms both required for reliable backref resolving (tree mod log and holding back delayed refs). You cannot make use of one without the other. So instead of requiring the user of this mechanism to setup both correctly, we join them into a single interface. Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list as we did before, which was of no value. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27Btrfs: avoid waiting for delayed refs when we must notJan Schmidt
We track two conditions to decide if we should sleep while waiting for more delayed refs, the number of delayed refs (num_refs) and the first entry in the list of blockers (first_seq). When we suspect staleness, we save num_refs and do one more cycle. If nothing changes, we then save first_seq for later comparison and do wait_event. We ought to save first_seq the very same moment we're saving num_refs. Otherwise we cannot be sure that nothing has changed and we might start waiting when we shouldn't, which could lead to starvation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-31Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into ↵Chris Mason
for-linus Conflicts: fs/btrfs/ulist.h Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-30Btrfs: convert the inode bit field to use the actual bit operationsJosef Bacik
Miao pointed this out while I was working on an orphan problem that messing with a bitfield where different ranges are protected by different locks doesn't work out right. Turns out we've been doing this forever where we have different parts of the bit field protected by either no lock at all or different locks which could cause all sorts of weird problems including the issue I was hitting. So instead make a runtime_flags thing that we use the normal bit operations on that are all atomic so we can keep having our no/different locking for the different flags and then make force_compress it's own thing so it can be treated normally. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-26Btrfs: don't set for_cow parameter for tree block functionsJan Schmidt
Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed parameter for_cow = 1. In fact, these two functions should never mark their tree modification operations as for_cow, because they can change the number of blocks referenced by a tree. Hence, we remove the extra for_cow parameter from these functions and make them pass a zero down. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-11Btrfs: cleanup: use consistent lock namingDan Carpenter
It confuses Smatch that we use two names for the same lock. Plus the shorter name is nicer. This doesn't change how the code works, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-05-06Btrfs: avoid sleeping in verify_parent_transid while atomicChris Mason
verify_parent_transid needs to lock the extent range to make sure no IO is underway, and so it can safely clear the uptodate bits if our checks fail. But, a few callers are using it with spinlocks held. Most of the time, the generation numbers are going to match, and we don't want to switch to a blocking lock just for the error case. This adds an atomic flag to verify_parent_transid, and changes it to return EAGAIN if it needs to block to properly verifiy things. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-27Btrfs: fix block_rsv and space_info lock orderingStefan Behrens
may_commit_transaction() calls spin_lock(&space_info->lock); spin_lock(&delayed_rsv->lock); and update_global_block_rsv() calls spin_lock(&block_rsv->lock); spin_lock(&sinfo->lock); Lockdep complains about this at run time. Everywhere except in update_global_block_rsv(), the space_info lock is the outer lock, therefore the locking order in update_global_block_rsv() is changed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-18btrfs: don't return EINTRArne Jansen
It is basically a good thing if we are interruptible when waiting for free space, but the generality in which it is implemented currently leads to system calls being interruptible that are not documented this way. For example git can't handle interrupted unlink(), leading to corrupt repos under space pressure. Instead we raise the bar to only be interruptible by SIGKILL. Thanks to David Sterba for suggesting this. Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-04-18Btrfs: double unlock bug in error handlingDan Carpenter
The caller expects this function to return with the lock held and releases it immediately on error. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-04-12Btrfs: use commit root when loading free space cacheJosef Bacik
A user reported that booting his box up with btrfs root on 3.4 was way slower than on 3.3 because I removed the ideal caching code. It turns out that we don't load the free space cache if we're in a commit for deadlock reasons, but since we're reading the cache and it hasn't changed yet we are safe reading the inode and free space item from the commit root, so do that and remove all of the deadlock checks so we don't unnecessarily skip loading the free space cache. The user reported this fixed the slowness. Thanks, Tested-by: Calvin Walton <calvin.walton@kepstin.ca> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-12Btrfs: remove lock assert from get_restripe_target()Ilya Dryomov
This fixes a regression introduced by fc67c450. spin_is_locked() always returns 0 on UP kernels, which caused assert in get_restripe_target() to be fired on every call from btrfs_reduce_alloc_profile() on UP systems. Remove it completely for now, it's not clear if it's going to be needed in future. Reported-by: Bobby Powers <bobbypowers@gmail.com> Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Tested-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-12Revert "Btrfs: increase the global block reserve estimates"Chris Mason
This reverts commit 5500cdbe14d7435e04f66ff3cfb8ecd8b8e44ebf. We've had a number of complaints of early enospc that bisect down to this patch. We'll hae to fix the reservations differently. CC: stable@kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: fix deadlock during allocating chunksLiu Bo
This deadlock comes from xfstests 251. We'll hold the chunk_mutex throughout the whole of a chunk allocation. But if we find that we've used up system chunk space, we need to allocate a new system chunk, but this will lead to a recursion of chunk allocation and end up with a deadlock on chunk_mutex. So instead we need to allocate the system chunk first if we find we're in ENOSPC. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: show useful info in space reservation tracepointLiu Bo
o For space info, the type of space info is useful for debug. o For transaction handle, its transid is useful. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28Merge branch 'for-chris' of git://github.com/idryomov/btrfs-unstable into ↵Chris Mason
for-linus
2012-03-28Merge branch 'error-handling' into for-linusChris Mason
Conflicts: fs/btrfs/ctree.c fs/btrfs/disk-io.c fs/btrfs/extent-tree.c fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/inode.c fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-27Btrfs: improve the logic in btrfs_can_relocate()Ilya Dryomov
Currently if we don't have enough space allocated we go ahead and loop though devices in the hopes of finding enough space for a chunk of the *same* type as the one we are trying to relocate. The problem with that is that if we are trying to restripe the chunk its target type can be more relaxed than the current one (eg require less devices or less space). So, when restriping, run checks against the target profile instead of the current one. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: add __get_block_group_index() helperIlya Dryomov
Add __get_block_group_index() helper to be able to derive block group index from an arbitary set of flags. Implement get_block_group_index() in terms of it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: add get_restripe_target() helperIlya Dryomov
Add get_restripe_target() helper and switch everybody to use it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: move alloc_profile_is_valid() to volumes.cIlya Dryomov
Header file is not a good place to define functions. This also moves a call to alloc_profile_is_valid() down the stack and removes a redundant check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it into account. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: make profile_is_valid() check more strictIlya Dryomov
"0" is a valid value for an on-disk chunk profile, but it is not a valid extended profile. (We have a separate bit for single chunks in extended case) Also rename it to alloc_profile_is_valid() for clarity. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: add wrappers for working with alloc profilesIlya Dryomov
Add functions to abstract the conversion between chunk and extended allocation profile formats and switch everybody to use them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: stop silently switching single chunks to raid0 on balanceIlya Dryomov
This has been causing a lot of confusion for quite a while now and a lot of users were surprised by this (some of them were even stuck in a ENOSPC situation which they couldn't easily get out of). The addition of restriper gives users a clear choice between raid0 and drive concat setup so there's absolutely no excuse for us to keep doing this. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-26Btrfs: introduce free_extent_buffer_staleJosef Bacik
Because btrfs cow's we can end up with extent buffers that are no longer necessary just sitting around in memory. So instead of evicting these pages, we could end up evicting things we actually care about. Thus we have free_extent_buffer_stale for use when we are freeing tree blocks. This will make it so that the ref for the eb being in the radix tree is dropped as soon as possible and then is freed when the refcount hits 0 instead of waiting to be released by releasepage. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: remove search_start and search_end from find_free_extent and callersJosef Bacik
We have been passing nothing but (u64)-1 to find_free_extent for search_end in all of the callers, so it's completely useless, and we've always been passing 0 in as search_start, so just remove them as function arguments and move search_start into find_free_extent. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: remove the ideal caching codeJosef Bacik
This is a relic from before we had the disk space cache and it was to make bootup times when you had btrfs as root not be so damned slow. Now that we have the disk space cache this isn't a problem anymore and really having this code casues uneeded fragmentation and complexity, so just remove it. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-22btrfs: replace many BUG_ONs with proper error handlingJeff Mahoney
btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: btrfs_drop_snapshot should return intJeff Mahoney
Commit cb1b69f4 (Btrfs: forced readonly when btrfs_drop_snapshot() fails) made btrfs_drop_snapshot return void because there were no callers checking the return value. That is the wrong order to handle error propogation since the caller will have no idea that an error has occured and continue on as if nothing went wrong. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: return void in functions without error conditionsJeff Mahoney
Signed-off-by: Jeff Mahoney <jeffm@suse.com>