summaryrefslogtreecommitdiffstats
path: root/net/ceph
AgeCommit message (Collapse)Author
2015-08-03crush: fix a bug in tree bucket decodeIlya Dryomov
commit 82cd003a77173c91b9acad8033fb7931dac8d751 upstream. struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe() should be used. -Wconversion catches this, but I guess it went unnoticed in all the noise it spews. The actual problem (at least for common crushmaps) isn't the u32 -> u8 truncation though - it's the advancement by 4 bytes instead of 1 in the crushmap buffer. Fixes: http://tracker.ceph.com/issues/2759 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-05libceph: request a new osdmap if lingering request maps to no osdIlya Dryomov
commit b0494532214bdfbf241e94fabab5dd46f7b82631 upstream. This commit does two things. First, if there are any homeless lingering requests, we now request a new osdmap even if the osdmap that is being processed brought no changes, i.e. if a given lingering request turned homeless in one of the previous epochs and remained homeless in the current epoch. Not doing so leaves us with a stale osdmap and as a result we may miss our window for reestablishing the watch and lose notifies. MON=1 OSD=1: # cat linger-needmap.sh #!/bin/bash rbd create --size 1 test DEV=$(rbd map test) ceph osd out 0 rbd map dne/dne # obtain a new osdmap as a side effect (!) sleep 1 ceph osd in 0 rbd resize --size 2 test # rbd info test | grep size -> 2M # blockdev --getsize $DEV -> 1M N.B.: Not obtaining a new osdmap in between "osd out" and "osd in" above is enough to make it miss that resize notify, but that is a bug^Wlimitation of ceph watch/notify v1. Second, homeless lingering requests are now kicked just like those lingering requests whose mapping has changed. This is mainly to recognize that a homeless lingering request makes no sense and to preserve the invariant that a registered lingering request is not sitting on any of r_req_lru_item lists. This spares us a WARN_ON, which commit ba9d114ec557 ("libceph: clear r_req_lru_item in __unregister_linger_request()") tried to fix the _wrong_ way. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06libceph: fix double __remove_osd() problemIlya Dryomov
commit 7eb71e0351fbb1b242ae70abb7bb17107fe2f792 upstream. It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: <osd3 is idle, on the osd lru list> <con reset - osd3> con_fault_finish() osd_reset() <osdmap - osd3 down> ceph_osdc_handle_map() <takes map_sem> kick_requests() <takes request_mutex> reset_changed_osds() __reset_osd() __remove_osd() <releases request_mutex> <releases map_sem> <takes map_sem> <takes request_mutex> __kick_osd_requests() __reset_osd() __remove_osd() <-- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06libceph: change from BUG to WARN for __remove_osd() assertsIlya Dryomov
commit cc9f1f518cec079289d11d732efa490306b1ddad upstream. No reason to use BUG_ON for osd request list assertions. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06libceph: assert both regular and lingering lists in __remove_osd()Ilya Dryomov
commit 7c6e6fc53e7335570ed82f77656cedce1502744e upstream. It is important that both regular and lingering requests lists are empty when the OSD is removed. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21libceph: do not crash on large auth ticketsIlya Dryomov
commit aaef31703a0cf6a733e651885bfb49edc3ac6774 upstream. Large (greater than 32k, the value of PAGE_ALLOC_COSTLY_ORDER) auth tickets will have their buffers vmalloc'ed, which leads to the following crash in crypto: [ 28.685082] BUG: unable to handle kernel paging request at ffffeb04000032c0 [ 28.686032] IP: [<ffffffff81392b42>] scatterwalk_pagedone+0x22/0x80 [ 28.686032] PGD 0 [ 28.688088] Oops: 0000 [#1] PREEMPT SMP [ 28.688088] Modules linked in: [ 28.688088] CPU: 0 PID: 878 Comm: kworker/0:2 Not tainted 3.17.0-vm+ #305 [ 28.688088] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 28.688088] Workqueue: ceph-msgr con_work [ 28.688088] task: ffff88011a7f9030 ti: ffff8800d903c000 task.ti: ffff8800d903c000 [ 28.688088] RIP: 0010:[<ffffffff81392b42>] [<ffffffff81392b42>] scatterwalk_pagedone+0x22/0x80 [ 28.688088] RSP: 0018:ffff8800d903f688 EFLAGS: 00010286 [ 28.688088] RAX: ffffeb04000032c0 RBX: ffff8800d903f718 RCX: ffffeb04000032c0 [ 28.688088] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d903f750 [ 28.688088] RBP: ffff8800d903f688 R08: 00000000000007de R09: ffff8800d903f880 [ 28.688088] R10: 18df467c72d6257b R11: 0000000000000000 R12: 0000000000000010 [ 28.688088] R13: ffff8800d903f750 R14: ffff8800d903f8a0 R15: 0000000000000000 [ 28.688088] FS: 00007f50a41c7700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000 [ 28.688088] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 28.688088] CR2: ffffeb04000032c0 CR3: 00000000da3f3000 CR4: 00000000000006b0 [ 28.688088] Stack: [ 28.688088] ffff8800d903f698 ffffffff81392ca8 ffff8800d903f6e8 ffffffff81395d32 [ 28.688088] ffff8800dac96000 ffff880000000000 ffff8800d903f980 ffff880119b7e020 [ 28.688088] ffff880119b7e010 0000000000000000 0000000000000010 0000000000000010 [ 28.688088] Call Trace: [ 28.688088] [<ffffffff81392ca8>] scatterwalk_done+0x38/0x40 [ 28.688088] [<ffffffff81392ca8>] scatterwalk_done+0x38/0x40 [ 28.688088] [<ffffffff81395d32>] blkcipher_walk_done+0x182/0x220 [ 28.688088] [<ffffffff813990bf>] crypto_cbc_encrypt+0x15f/0x180 [ 28.688088] [<ffffffff81399780>] ? crypto_aes_set_key+0x30/0x30 [ 28.688088] [<ffffffff8156c40c>] ceph_aes_encrypt2+0x29c/0x2e0 [ 28.688088] [<ffffffff8156d2a3>] ceph_encrypt2+0x93/0xb0 [ 28.688088] [<ffffffff8156d7da>] ceph_x_encrypt+0x4a/0x60 [ 28.688088] [<ffffffff8155b39d>] ? ceph_buffer_new+0x5d/0xf0 [ 28.688088] [<ffffffff8156e837>] ceph_x_build_authorizer.isra.6+0x297/0x360 [ 28.688088] [<ffffffff8112089b>] ? kmem_cache_alloc_trace+0x11b/0x1c0 [ 28.688088] [<ffffffff8156b496>] ? ceph_auth_create_authorizer+0x36/0x80 [ 28.688088] [<ffffffff8156ed83>] ceph_x_create_authorizer+0x63/0xd0 [ 28.688088] [<ffffffff8156b4b4>] ceph_auth_create_authorizer+0x54/0x80 [ 28.688088] [<ffffffff8155f7c0>] get_authorizer+0x80/0xd0 [ 28.688088] [<ffffffff81555a8b>] prepare_write_connect+0x18b/0x2b0 [ 28.688088] [<ffffffff81559289>] try_read+0x1e59/0x1f10 This is because we set up crypto scatterlists as if all buffers were kmalloc'ed. Fix it. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14libceph: ceph-msgr workqueue needs a resque workerIlya Dryomov
commit f9865f06f7f18c6661c88d0511f05c48612319cc upstream. Commit f363e45fd118 ("net/ceph: make ceph_msgr_wq non-reentrant") effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq. This is wrong - libceph is very much a memory reclaim path, so restore it. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Tested-by: Micha Krause <micha@krausam.de> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17libceph: gracefully handle large reply messages from the monSage Weil
commit 73c3d4812b4c755efeca0140f606f83772a39ce4 upstream. We preallocate a few of the message types we get back from the mon. If we get a larger message than we are expecting, fall back to trying to allocate a new one instead of blindly using the one we have. Signed-off-by: Sage Weil <sage@redhat.com> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17libceph: rename ceph_msg::front_max to front_alloc_lenIlya Dryomov
commit 3cea4c3071d4e55e9d7356efe9d0ebf92f0c2204 upstream. Rename front_max field of struct ceph_msg to front_alloc_len to make its purpose more clear. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17libceph: do not hard code max auth ticket lenIlya Dryomov
commit c27a3e4d667fdcad3db7b104f75659478e0c68d8 upstream. We hard code cephx auth ticket buffer size to 256 bytes. This isn't enough for any moderate setups and, in case tickets themselves are not encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but ceph_decode_copy() doesn't - it's just a memcpy() wrapper). Since the buffer is allocated dynamically anyway, allocated it a bit later, at the point where we know how much is going to be needed. Fixes: http://tracker.ceph.com/issues/8979 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17libceph: add process_one_ticket() helperIlya Dryomov
commit 597cda357716a3cf8d994cb11927af917c8d71fa upstream. Add a helper for processing individual cephx auth tickets. Needed for the next commit, which deals with allocating ticket buffers. (Most of the diff here is whitespace - view with git diff -b). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctlyIlya Dryomov
commit 5f740d7e1531099b888410e6bab13f68da9b1a4d upstream. Determining ->last_piece based on the value of ->page_offset + length is incorrect because length here is the length of the entire message. ->last_piece set to false even if page array data item length is <= PAGE_SIZE, which results in invalid length passed to ceph_tcp_{send,recv}page() and causes various asserts to fire. # cat pages-cursor-init.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar # rbd_resize calls librbd rbd_resize(), size is in bytes ./rbd_resize bar $(((4 << 20) + 512)) rbd resize --size 10 bar BAR_DEV=$(rbd map bar) # trigger a 512-byte copyup -- 512-byte page array data item dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5 The problem exists only in ceph_msg_data_pages_cursor_init(), ceph_msg_data_pages_advance() does the right thing. The size_t cast is unnecessary. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07libceph: fix corruption when using page_count 0 page in rbdChunwei Chen
commit 178eda29ca721842f2146378e73d43e0044c4166 upstream. It has been reported that using ZFSonLinux on rbd will result in memory corruption. The bug report can be found here: https://github.com/zfsonlinux/spl/issues/241 http://tracker.ceph.com/issues/7790 The reason is that ZFS will send pages with page_count 0 into rbd, which in turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with page_count 0, as it will do get_page and put_page, and erroneously free the page. This type of issue has been noted before, and handled in iscsi, drbd, etc. So, rbd should also handle this. This fix address this issue by fall back to slower sendmsg when page_count 0 detected. Cc: Sage Weil <sage@inktank.com> Cc: Yehuda Sadeh <yehuda@inktank.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-31libceph: resend all writes after the osdmap loses the full flagJosh Durgin
commit 9a1ea2dbff11547a8e664f143c1ffefc586a577a upstream. With the current full handling, there is a race between osds and clients getting the first map marked full. If the osd wins, it will return -ENOSPC to any writes, but the client may already have writes in flight. This results in the client getting the error and propagating it up the stack. For rbd, the block layer turns this into EIO, which can cause corruption in filesystems above it. To avoid this race, osds are being changed to drop writes that came from clients with an osdmap older than the last osdmap marked full. In order for this to work, clients must resend all writes after they encounter a full -> not full transition in the osdmap. osds will wait for an updated map instead of processing a request from a client with a newer map, so resent writes will not be dropped by the osd unless there is another not full -> full transition. This approach requires both osds and clients to be fixed to avoid the race. Old clients talking to osds with this fix may hang instead of returning EIO and potentially corrupting an fs. New clients talking to old osds have the same behavior as before if they encounter this race. Fixes: http://tracker.ceph.com/issues/6938 Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-31libceph: block I/O when PAUSE or FULL osd map flags are setJosh Durgin
commit d29adb34a94715174c88ca93e8aba955850c9bde upstream. The PAUSEWR and PAUSERD flags are meant to stop the cluster from processing writes and reads, respectively. The FULL flag is set when the cluster determines that it is out of space, and will no longer process writes. PAUSEWR and PAUSERD are purely client-side settings already implemented in userspace clients. The osd does nothing special with these flags. When the FULL flag is set, however, the osd responds to all writes with -ENOSPC. For cephfs, this makes sense, but for rbd the block layer translates this into EIO. If a cluster goes from full to non-full quickly, a filesystem on top of rbd will not behave well, since some writes succeed while others get EIO. Fix this by blocking any writes when the FULL flag is set in the osd client. This is the same strategy used by userspace, so apply it by default. A follow-on patch makes this configurable. __map_request() is called to re-target osd requests in case the available osds changed. Add a paused field to a ceph_osd_request, and set it whenever an appropriate osd map flag is set. Avoid queueing paused requests in __map_request(), but force them to be resent if they become unpaused. Also subscribe to the next osd map from the monitor if any of these flags are set, so paused requests can be unblocked as soon as possible. Fixes: http://tracker.ceph.com/issues/6079 Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: add function to ensure notifies are completeJosh Durgin
commit dd935f44a40f8fb02aff2cc0df2269c92422df1c upstream. Without a way to flush the osd client's notify workqueue, a watch event that is unregistered could continue receiving callbacks indefinitely. Unregistering the event simply means no new notifies are added to the queue, but there may still be events in the queue that will call the watch callback for the event. If the queue is flushed after the event is unregistered, the caller can be sure no more watch callbacks will occur for the canceled watch. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: create_singlethread_workqueue() doesn't return ERR_PTRsDan Carpenter
commit dbcae088fa660086bde6e10d63bb3c9264832d85 upstream. create_singlethread_workqueue() returns NULL on error, and it doesn't return ERR_PTRs. I tweaked the error handling a little to be consistent with earlier in the function. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: potential NULL dereference in ceph_osdc_handle_map()Dan Carpenter
commit b72e19b9225d4297a18715b0998093d843d170fa upstream. There are two places where we read "nr_maps" if both of them are set to zero then we would hit a NULL dereference here. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: fix error handling in handle_reply()Dan Carpenter
commit 1874119664dafda3ef2ed9b51b4759a9540d4a1a upstream. We've tried to fix the error paths in this function before, but there is still a hidden goto in the ceph_decode_need() macro which goes to the wrong place. We need to release the "req" and unlock a mutex before returning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: call r_unsafe_callback when unsafe reply is receivedYan, Zheng
commit 61c5d6bf7074ee32d014dcdf7698dc8c59eb712d upstream. We can't use !req->r_sent to check if OSD request is sent for the first time, this is because __cancel_request() zeros req->r_sent when OSD map changes. Rather than adding a new variable to struct ceph_osd_request to indicate if it's sent for the first time, We can call the unsafe callback only when unsafe OSD reply is received. If OSD's first reply is safe, just skip calling the unsafe callback. The purpose of unsafe callback is adding unsafe request to a list, so that fsync(2) can wait for the safe reply. fsync(2) doesn't need to wait for a write(2) that hasn't returned yet. So it's OK to add request to the unsafe list when the first OSD reply is received. (ceph_sync_write() returns after receiving the first OSD reply) Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: fix truncate size calculationYan, Zheng
commit ccca4e37b1a912da3db68aee826557ea66145273 upstream. check the "not truncated yet" case Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: fix safe completionYan, Zheng
commit eb845ff13a44477f8a411baedbf11d678b9daf0a upstream. handle_reply() calls complete_request() only if the first OSD reply has ONDISK flag. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: add lingering request reference when registeredAlex Elder
commit 96e4dac66f69d28af2b736e723364efbbdf9fdee upstream. When an osd request is set to linger, the osd client holds onto the request so it can be re-submitted following certain osd map changes. The osd client holds a reference to the request until it is unregistered. This is used by rbd for watch requests. Currently, the reference is taken when the request is marked with the linger flag. This means that if an error occurs after that time but before the the request completes successfully, that reference is leaked. There's really no reason to take the reference until the request is registered in the the osd client's list of lingering requests, and that only happens when the lingering (watch) request completes successfully. So take that reference only when it gets registered following succesful completion, and drop it (as before) when the request gets unregistered. This avoids the reference problem on error in rbd. Rearrange ceph_osdc_unregister_linger_request() to avoid using the request pointer after it may have been freed. And hold an extra reference in kick_requests() while handling a linger request that has not yet been registered, to ensure it doesn't go away. This resolves: http://tracker.ceph.com/issues/3859 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calcSage Weil
commit 9542cf0bf9b1a3adcc2ef271edbcbdba03abf345 upstream. Fix a typo that used the wrong bitmask for the pg.seed calculation. This is normally unnoticed because in most cases pg_num == pgp_num. It is, however, a bug that is easily corrected. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <alex.elder@linary.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26libceph: unregister request in __map_request failed and nofail == falsemajianpeng
commit 73d9f7eef3d98c3920e144797cc1894c6b005a1e upstream. For nofail == false request, if __map_request failed, the caller does cleanup work, like releasing the relative pages. It doesn't make any sense to retry this request. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-13libceph: Fix NULL pointer dereference in auth client codeTyler Hicks
commit 2cb33cac622afde897aa02d3dcd9fbba8bae839e upstream. A malicious monitor can craft an auth reply message that could cause a NULL function pointer dereference in the client's kernel. To prevent this, the auth_none protocol handler needs an empty ceph_auth_client_ops->build_request() function. CVE-2013-1059 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Chanam Park <chanam.park@hkpco.kr> Reviewed-by: Seth Arnold <seth.arnold@canonical.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-17libceph: must hold mutex for reset_changed_osds()Alex Elder
An osd client has a red-black tree describing its osds, and occasionally we would get crashes due to one of these trees tree becoming corrupt somehow. The problem turned out to be that reset_changed_osds() was being called without protection of the osd client request mutex. That function would call __reset_osd() for any osd that had changed, and __reset_osd() would call __remove_osd() for any osd with no outstanding requests, and finally __remove_osd() would remove the corresponding entry from the red-black tree. Thus, the tree was getting modified without having any lock protection, and was vulnerable to problems due to concurrent updates. This appears to be the only osd tree updating path that has this problem. It can be fairly easily fixed by moving the call up a few lines, to just before the request mutex gets dropped in kick_requests(). This resolves: http://tracker.ceph.com/issues/5043 Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-13libceph: init sent and completed when startingAlex Elder
The rbd code has a need to be able to restart an osd request that has already been started and completed once before. This currently wouldn't work right because the osd client code assumes an osd request will be started exactly once Certain fields in a request are never cleared and this leads to trouble if you try to reuse it. Specifically, the r_sent, r_got_reply, and r_completed fields are never cleared. The r_sent field records the osd incarnation at the time the request was sent to that osd. If that's non-zero, the message won't get re-mapped to a target osd properly, and won't be put on the unsafe requests list the first time it's sent as it should. The r_got_reply field is used in handle_reply() to ensure the reply to a request is processed only once. And the r_completed field is used for lingering requests to avoid calling the callback function every time the osd client re-sends the request on behalf of its initiator. Each osd request passes through ceph_osdc_start_request() when responsibility for the request is handed over to the osd client for completion. We can safely zero these three fields there each time a request gets started. One last related change--clear the r_linger flag when a request is no longer registered as a linger request. This resolves: http://tracker.ceph.com/issues/5026 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02libceph: use slab cache for osd client requestsAlex Elder
Create a slab cache to manage allocation of ceph_osdc_request structures. This resolves: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02libceph: allocate ceph message data with a slab allocatorAlex Elder
Create a slab cache to manage ceph_msg_data structure allocation. This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02libceph: allocate ceph messages with a slab allocatorAlex Elder
Create a slab cache to manage ceph_msg structure allocation. This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: create source file "net/ceph/snapshot.c"Alex Elder
This creates a new source file "net/ceph/snapshot.c" to contain utility routines related to ceph snapshot contexts. The main motivation was to define ceph_create_snap_context() as a common way to create these structures, but I've moved the definitions of ceph_get_snap_context() and ceph_put_snap_context() there too. (The benefit of inlining those is very small, and I'd rather keep this collection of functions together.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: fix byte order mismatchAlex Elder
A WATCH op includes an object version. The version that's supplied is incorrectly byte-swapped osd_req_op_watch_init() where it's first assigned (it's been this way since that code was first added). The result is that the version sent to the osd is wrong, because that value gets byte-swapped again in osd_req_encode_op(). This is the source of a sparse warning related to improper byte order in the assignment. The approach of using the version to avoid a race is deprecated (see http://tracker.ceph.com/issues/3871), and the watch parameter is no longer even examined by the osd. So fix the assignment in osd_req_op_watch_init() so it no longer does the byte swap. This resolves: http://tracker.ceph.com/issues/3847 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: support pages for class request dataAlex Elder
Add the ability to provide an array of pages as outbound request data for object class method calls. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: fix two messenger bugsAlex Elder
This patch makes four small changes in the ceph messenger. While getting copyup functionality working I found two bugs in the messenger. Existing paths through the code did not trigger these problems, but they're fixed here: - In ceph_msg_data_pagelist_cursor_init(), the cursor's last_piece field was being checked against the length supplied. This was OK until this commit: ccba6d98 libceph: implement multiple data items in a message That commit changed the cursor init routines to allow lengths to be supplied that exceeded the size of the current data item. Because of this, we have to use the assigned cursor resid field rather than the provided length in determining whether the cursor points to the last piece of a data item. - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously catching attempts to add page data to a message if the message already had data assigned to it. That was OK until that same commit, at which point it was fine for messages to have multiple data items. It slipped through because that BUG_ON() call was present twice in that function. (You can never be too careful.) In addition two other minor things are changed: - In ceph_msg_data_cursor_init(), the local variable "data" was getting assigned twice. - In ceph_msg_data_advance(), it was assumed that the type-specific advance routine would set new_piece to true after it advanced past the last piece. That may have been fine, but since we check for that case we might as well set it explicitly in ceph_msg_data_advance(). This resolves: http://tracker.ceph.com/issues/4762 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: support raw data requestsAlex Elder
Allow osd request ops that aren't otherwise structured (not class, extent, or watch ops) to specify "raw" data to be used to hold incoming data for the op. Make use of this capability for the osd STAT op. Prefix the name of the private function osd_req_op_init() with "_", and expose a new function by that (earlier) name whose purpose is to initialize osd ops with (only) implied data. For now we'll just support the use of a page array for an osd op with incoming raw data. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: clean up osd data field access functionsAlex Elder
There are a bunch of functions defined to encapsulate getting the address of a data field for a particular op in an osd request. They're all defined the same way, so create a macro to take the place of all of them. Two of these are used outside the osd client code, so preserve them (but convert them to use the new macro internally). Stop exporting the ones that aren't used elsewhere. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: kill off osd data write_request parametersAlex Elder
In the incremental move toward supporting distinct data items in an osd request some of the functions had "write_request" parameters to indicate, basically, whether the data belonged to in_data or the out_data. Now that we maintain the data fields in the op structure there is no need to indicate the direction, so get rid of the "write_request" parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: change how "safe" callback is usedAlex Elder
An osd request currently has two callbacks. They inform the initiator of the request when we've received confirmation for the target osd that a request was received, and when the osd indicates all changes described by the request are durable. The only time the second callback is used is in the ceph file system for a synchronous write. There's a race that makes some handling of this case unsafe. This patch addresses this problem. The error handling for this callback is also kind of gross, and this patch changes that as well. In ceph_sync_write(), if a safe callback is requested we want to add the request on the ceph inode's unsafe items list. Because items on this list must have their tid set (by ceph_osd_start_request()), the request added *after* the call to that function returns. The problem with this is that there's a race between starting the request and adding it to the unsafe items list; the request may already be complete before ceph_sync_write() even begins to put it on the list. To address this, we change the way the "safe" callback is used. Rather than just calling it when the request is "safe", we use it to notify the initiator the bounds (start and end) of the period during which the request is *unsafe*. So the initiator gets notified just before the request gets sent to the osd (when it is "unsafe"), and again when it's known the results are durable (it's no longer unsafe). The first call will get made in __send_request(), just before the request message gets sent to the messenger for the first time. That function is only called by __send_queued(), which is always called with the osd client's request mutex held. We then have this callback function insert the request on the ceph inode's unsafe list when we're told the request is unsafe. This will avoid the race because this call will be made under protection of the osd client's request mutex. It also nicely groups the setup and cleanup of the state associated with managing unsafe requests. The name of the "safe" callback field is changed to "unsafe" to better reflect its new purpose. It has a Boolean "unsafe" parameter to indicate whether the request is becoming unsafe or is now safe. Because the "msg" parameter wasn't used, we drop that. This resolves the original problem reportedin: http://tracker.ceph.com/issues/4706 Reported-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01libceph: make method call data be a separate data itemAlex Elder
Right now the data for a method call is specified via a pointer and length, and it's copied--along with the class and method name--into a pagelist data item to be sent to the osd. Instead, encode the data in a data item separate from the class and method names. This will allow large amounts of data to be supplied to methods without copying. Only rbd uses the class functionality right now, and when it really needs this it will probably need to use a page array rather than a page list. But this simple implementation demonstrates the functionality on the osd client, and that's enough for now. This resolves: http://tracker.ceph.com/issues/4104 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: add, don't set data for a messageAlex Elder
Change the names of the functions that put data on a pagelist to reflect that we're adding to whatever's already there rather than just setting it to the one thing. Currently only one data item is ever added to a message, but that's about to change. This resolves: http://tracker.ceph.com/issues/2770 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: implement multiple data items in a messageAlex Elder
This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: replace message data pointer with listAlex Elder
In place of the message data pointer, use a list head which links through message data items. For now we only support a single entry on that list. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: have cursor point to dataAlex Elder
Rather than having a ceph message data item point to the cursor it's associated with, have the cursor point to a data item. This will allow a message cursor to be used for more than one data item. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: move cursor into messageAlex Elder
A message will only be processing a single data item at a time, so there's no need for each data item to have its own cursor. Move the cursor embedded in the message data structure into the message itself. To minimize the impact, keep the data->cursor field, but make it be a pointer to the cursor in the message. Move the definition of ceph_msg_data above ceph_msg_data_cursor so the cursor can point to the data without a forward definition rather than vice-versa. This and the upcoming patches are part of: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: record bio lengthAlex Elder
The bio is the only data item type that doesn't record its full length. Fix that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: skip message if too big to receiveAlex Elder
We know the length of our message buffers. If we get a message that's too long, just dump it and ignore it. If skip was set then con->in_msg won't be valid, so be careful not to dereference a null pointer in the process. This resolves: http://tracker.ceph.com/issues/4664 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: fix possible CONFIG_BLOCK build problemAlex Elder
This patch: 15a0d7b libceph: record message data length did not enclose some bio-specific code inside CONFIG_BLOCK as it should have. Fix that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: kill off osd request r_data_in and r_data_outAlex Elder
Finally! Convert the osd op data pointers into real structures, and make the switch over to using them instead of having all ops share the in and/or out data structures in the osd request. Set up a new function to traverse the set of ops and release any data associated with them (pages). This and the patches leading up to it resolve: http://tracker.ceph.com/issues/4657 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: set the data pointers when encoding opsAlex Elder
Still using the osd request r_data_in and r_data_out pointer, but we're basically only referring to it via the data pointers in the osd ops. And we're transferring that information to the request or reply message only when the op indicates it's needed, in osd_req_encode_op(). To avoid a forward reference, ceph_osdc_msg_data_set() was moved up in the file. Don't bother calling ceph_osd_data_init(), in ceph_osd_alloc(), because the ops array will already be zeroed anyway. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>