aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-01-09arm64: fix possible invalid FPSIMD initialization stateJiang Liu
commit 6db83cea1c975b9a102e17def7d2795814e1ae2b upstream. If context switching happens during executing fpsimd_flush_thread(), stale value in FPSIMD registers will be saved into current thread's fpsimd_state by fpsimd_thread_switch(). That may cause invalid initialization state for the new process, so disable preemption when executing fpsimd_flush_thread(). Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Jiang Liu <liuj97@gmail.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09arm64: Change kernel stack size to 16KFeng Kan
commit 845ad05ec31e0f3872a321e10dbeaf872022632c upstream. Written by Catalin Marinas, tested by APM on storm platform. This is needed because of the failures encountered when running SpecWeb benchmark test. Signed-off-by: Feng Kan <fkan@apm.com> Acked-by: Kumar Sankaran <ksankaran@apm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09arm64: virt: ensure visibility of __boot_cpu_modeMark Rutland
commit 82b2f495fba338d1e3098dde1df54944a9c19751 upstream. Secondary CPUs write to __boot_cpu_mode with caches disabled, and thus a cached value of __boot_cpu_mode may be incoherent with that in memory. This could lead to a failure to detect mismatched boot modes. This patch adds flushing to ensure that writes by secondaries to __boot_cpu_mode are made visible before we test against it. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@cs.columbia.edu> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09arm64: Only enable local interrupts after the CPU is marked onlineCatalin Marinas
commit 53ae3acd4390ffeecb3a11dbd5be347b5a3d98f2 upstream. There is a slight chance that (timer) interrupts are triggered before a secondary CPU has been marked online with implications on softirq thread affinity. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Kirill Tkhai <tkhai@yandex.ru> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: fix error handling from rbd_snap_name()Josh Durgin
commit da6a6b63978d45f9ae582d1f362f182012da3a22 upstream. rbd_snap_name() calls rbd_dev_v{1,2}_snap_name() depending on the format of the image. The format 1 version returns NULL on error, which is handled by the caller. The format 2 version returns an ERR_PTR, which the caller of rbd_snap_name() does not expect. Fortunately this is unlikely to occur in practice because rbd_snap_id_by_name() is called before rbd_snap_name(). This would hit similar errors to rbd_snap_name() (like the snapshot not existing) and return early, so rbd_snap_name() would not hit an error unless the snapshot was removed between the two calls or memory was exhausted. Use an ERR_PTR in rbd_dev_v1_snap_name() so that the specific error can be propagated, and it is consistent with rbd_dev_v2_snap_name(). Handle the ERR_PTR in the only rbd_snap_name() caller. Suggested-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: ignore unmapped snapshots that no longer existJosh Durgin
commit efadc98aab674153709cc357ba565f04e3164fcd upstream. This prevents erroring out while adding a device when a snapshot unrelated to the current mapping is deleted between reading the snapshot context and reading the snapshot names. If the mapped snapshot name is not found an error still occurs as usual. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: fix use-after free of rbd_dev->diskJosh Durgin
commit 9875201e10496612080e7d164acc8f625c18725c upstream. Removing a device deallocates the disk, unschedules the watch, and finally cleans up the rbd_dev structure. rbd_dev_refresh(), called from the watch callback, updates the disk size and rbd_dev structure. With no locking between them, rbd_dev_refresh() may use the device or rbd_dev after they've been freed. To fix this, check whether RBD_DEV_FLAG_REMOVING is set before updating the disk size in rbd_dev_refresh(). In order to prevent a race where rbd_dev_refresh() is already revalidating the disk when rbd_remove() is called, move the call to rbd_bus_del_dev() after the watch is unregistered and all notifies are complete. It's safe to defer deleting this structure because no new requests can be submitted once the RBD_DEV_FLAG_REMOVING is set, since the device cannot be opened. Fixes: http://tracker.ceph.com/issues/5636 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: make rbd_obj_notify_ack() synchronousJosh Durgin
commit 20e0af67ce88c657d0601977b9941a2256afbdaa upstream. The only user of rbd_obj_notify_ack() is rbd_watch_cb(). It used asynchronously with no tracking of when the notify ack completes, so it may still be in progress when the osd_client is shut down. This results in a BUG() since the osd client assumes no requests are in flight when it stops. Since all notifies are flushed before the osd_client is stopped, waiting for the notify ack to complete before returning from the watch callback ensures there are no notify acks in flight during shutdown. Rename rbd_obj_notify_ack() to rbd_obj_notify_ack_sync() to reflect its new synchronous nature. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: complete notifies before cleaning up osd_client and rbd_devJosh Durgin
commit 9abc59908e0c5f983aaa91150da32d5b62cf60b7 upstream. To ensure rbd_dev is not used after it's released, flush all pending notify callbacks before calling rbd_dev_image_release(). No new notifies can be added to the queue at this point because the watch has already be unregistered with the osd_client. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: add function to ensure notifies are completeJosh Durgin
commit dd935f44a40f8fb02aff2cc0df2269c92422df1c upstream. Without a way to flush the osd client's notify workqueue, a watch event that is unregistered could continue receiving callbacks indefinitely. Unregistering the event simply means no new notifies are added to the queue, but there may still be events in the queue that will call the watch callback for the event. If the queue is flushed after the event is unregistered, the caller can be sure no more watch callbacks will occur for the canceled watch. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: fix null dereference in doutJosh Durgin
commit c35455791c1131e7ccbf56ea6fbdd562401c2ce2 upstream. The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but the dout() always derefences it. Move this to another dout() protected by a check that order is non-NULL. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: fix buffer size for writes to images with snapshotsJosh Durgin
commit 03507db631c94a48e316c7f638ffb2991544d617 upstream. rbd_osd_req_create() needs to know the snapshot context size to create a buffer large enough to send it with the message front. It gets this from the img_request, which was not set for the obj_request yet. This resulted in trying to write past the end of the front payload, hitting this BUG: libceph: BUG_ON(p > msg->front.iov_base + msg->front.iov_len); Fix this by associating the obj_request with its img_request immediately after it's created, before the osd request is created. Fixes: http://tracker.ceph.com/issues/5760 Suggested-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: allow sync_read/write return partial successed size of read/write.majianpeng
commit ee7289bfadda5f4ef60884547ebc9989c8fb314a upstream. For sync_read/write, it may do multi stripe operations.If one of those met erro, we return the former successed size rather than a error value. There is a exception for write-operation met -EOLDSNAPC.If this occur,we retry the whole write again. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: fix bugs about handling short-read for sync read mode.majianpeng
commit 02ae66d8b229708fd94b764f6c17ead1c7741fcf upstream. cephfs . show_layout >layyout.data_pool: 0 >layout.object_size: 4194304 >layout.stripe_unit: 4194304 >layout.stripe_count: 1 TestA: >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct >dd if=/dev/urandom of=test bs=1M count=2 seek=4 oflag=direct >dd if=test of=/dev/null bs=6M count=1 iflag=direct The messages from func striped_read are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT ceph: file.c:350 : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT ceph: file.c:381 : zero tail 4194304 ceph: file.c:390 : striped_read returns 6291456 The hole of file is from 2M--4M.But actualy it zero the last 4M include the last 2M area which isn't a hole. Using this patch, the messages are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT ceph: file.c:358 : zero gap 2097152 to 4194304 ceph: file.c:350 : striped_read 4194304~2097152 (read 4194304) got 2097152 ceph: file.c:384 : striped_read returns 6291456 TestB: >echo majianpeng > test >dd if=test of=/dev/null bs=2M count=1 iflag=direct The messages are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT ceph: file.c:350 : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT ceph: file.c:390 : striped_read returns 11 For this case,it did once more striped_read.It's no meaningless. Using this patch, the message are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT ceph: file.c:384 : striped_read returns 11 Big thanks to Yan Zheng for the patch. Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: create_singlethread_workqueue() doesn't return ERR_PTRsDan Carpenter
commit dbcae088fa660086bde6e10d63bb3c9264832d85 upstream. create_singlethread_workqueue() returns NULL on error, and it doesn't return ERR_PTRs. I tweaked the error handling a little to be consistent with earlier in the function. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: potential NULL dereference in ceph_osdc_handle_map()Dan Carpenter
commit b72e19b9225d4297a18715b0998093d843d170fa upstream. There are two places where we read "nr_maps" if both of them are set to zero then we would hit a NULL dereference here. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: fix error handling in handle_reply()Dan Carpenter
commit 1874119664dafda3ef2ed9b51b4759a9540d4a1a upstream. We've tried to fix the error paths in this function before, but there is still a hidden goto in the ceph_decode_need() macro which goes to the wrong place. We need to release the "req" and unlock a mutex before returning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: Add check returned value on func ceph_calc_ceph_pg.majianpeng
commit 2fbcbff1d6b9243ef71c64a8ab993bc3c7bb7af1 upstream. Func ceph_calc_ceph_pg maybe failed.So add check for returned value. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: cleanup types in striped_read()Dan Carpenter
commit 688bac461ba3e9d221a879ab40b687f5d7b5b19c upstream. We pass in a u64 value for "len" and then immediately truncate away the upper 32 bits. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: fix null pointer dereferenceNathaniel Yazdani
commit c338c07c51e3106711fad5eb599e375eadb6855d upstream. When register_session() is given an out-of-range argument for mds, ceph_mdsmap_get_addr() will return a null pointer, which would be given to ceph_con_open() & be dereferenced, causing a kernel oops. This fixes bug #4685 in the Ceph bug tracker <http://tracker.ceph.com/issues/4685>. Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: call r_unsafe_callback when unsafe reply is receivedYan, Zheng
commit 61c5d6bf7074ee32d014dcdf7698dc8c59eb712d upstream. We can't use !req->r_sent to check if OSD request is sent for the first time, this is because __cancel_request() zeros req->r_sent when OSD map changes. Rather than adding a new variable to struct ceph_osd_request to indicate if it's sent for the first time, We can call the unsafe callback only when unsafe OSD reply is received. If OSD's first reply is safe, just skip calling the unsafe callback. The purpose of unsafe callback is adding unsafe request to a list, so that fsync(2) can wait for the safe reply. fsync(2) doesn't need to wait for a write(2) that hasn't returned yet. So it's OK to add request to the unsafe list when the first OSD reply is received. (ceph_sync_write() returns after receiving the first OSD reply) Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: avoid accessing invalid memorySasha Levin
commit 5446429630257f4723829409337a26c076907d5d upstream. when mounting ceph with a dev name that starts with a slash, ceph would attempt to access the character before that slash. Since we don't actually own that byte of memory, we would trigger an invalid access: [ 43.499934] BUG: unable to handle kernel paging request at ffff880fa3a97fff [ 43.500984] IP: [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300 [ 43.501491] PGD 743b067 PUD 10283c4067 PMD 10282a6067 PTE 8000000fa3a97060 [ 43.502301] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 43.503006] Dumping ftrace buffer: [ 43.503596] (ftrace buffer empty) [ 43.504046] CPU: 0 PID: 10879 Comm: mount Tainted: G W 3.10.0-sasha #1129 [ 43.504851] task: ffff880fa625b000 ti: ffff880fa3412000 task.ti: ffff880fa3412000 [ 43.505608] RIP: 0010:[<ffffffff818f3884>] [<ffffffff818f3884>] parse_mount_options$ [ 43.506552] RSP: 0018:ffff880fa3413d08 EFLAGS: 00010286 [ 43.507133] RAX: ffff880fa3a98000 RBX: ffff880fa3a98000 RCX: 0000000000000000 [ 43.507893] RDX: ffff880fa3a98001 RSI: 000000000000002f RDI: ffff880fa3a98000 [ 43.508610] RBP: ffff880fa3413d58 R08: 0000000000001f99 R09: ffff880fa3fe64c0 [ 43.509426] R10: ffff880fa3413d98 R11: ffff880fa38710d8 R12: ffff880fa3413da0 [ 43.509792] R13: ffff880fa3a97fff R14: 0000000000000000 R15: ffff880fa3413d90 [ 43.509792] FS: 00007fa9c48757e0(0000) GS:ffff880fd2600000(0000) knlGS:000000000000$ [ 43.509792] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 43.509792] CR2: ffff880fa3a97fff CR3: 0000000fa3bb9000 CR4: 00000000000006b0 [ 43.509792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 43.509792] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 43.509792] Stack: [ 43.509792] 0000e5180000000e ffffffff85ca1900 ffff880fa38710d8 ffff880fa3413d98 [ 43.509792] 0000000000000120 0000000000000000 ffff880fa3a98000 0000000000000000 [ 43.509792] ffffffff85cf32a0 0000000000000000 ffff880fa3413dc8 ffffffff818f3c72 [ 43.509792] Call Trace: [ 43.509792] [<ffffffff818f3c72>] ceph_mount+0xa2/0x390 [ 43.509792] [<ffffffff81226314>] ? pcpu_alloc+0x334/0x3c0 [ 43.509792] [<ffffffff81282f8d>] mount_fs+0x8d/0x1a0 [ 43.509792] [<ffffffff812263d0>] ? __alloc_percpu+0x10/0x20 [ 43.509792] [<ffffffff8129f799>] vfs_kern_mount+0x79/0x100 [ 43.509792] [<ffffffff812a224d>] do_new_mount+0xcd/0x1c0 [ 43.509792] [<ffffffff812a2e8d>] do_mount+0x15d/0x210 [ 43.509792] [<ffffffff81220e55>] ? strndup_user+0x45/0x60 [ 43.509792] [<ffffffff812a2fdd>] SyS_mount+0x9d/0xe0 [ 43.509792] [<ffffffff83fd816c>] tracesys+0xdd/0xe2 [ 43.509792] Code: 4c 8b 5d c0 74 0a 48 8d 50 01 49 89 14 24 eb 17 31 c0 48 83 c9 ff $ [ 43.509792] RIP [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300 [ 43.509792] RSP <ffff880fa3413d08> [ 43.509792] CR2: ffff880fa3a97fff [ 43.509792] ---[ end trace 22469cd81e93af51 ]--- Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: Sage Weil <sage@inktan.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: Free mdsc if alloc mdsc->mdsmap failed.majianpeng
commit fb3101b6f0db9ae3f35dc8e6ec908d0af8cdf12e upstream. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: fix a couple warningsSage Weil
commit e976cad0f0dbe5440a4ca38e29e1f932d9319125 upstream. gcc isn't quite smart enough and generates these warnings: drivers/block/rbd.c: In function 'rbd_img_request_fill': drivers/block/rbd.c:1266:22: warning: 'bio_list' may be used uninitialized in this function [-Wmaybe-uninitialized] drivers/block/rbd.c:2186:14: note: 'bio_list' was declared here drivers/block/rbd.c:2247:10: warning: 'pages' may be used uninitialized in this function [-Wmaybe-uninitialized] even though they are initialized for their respective code paths. Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: fix truncate size calculationYan, Zheng
commit ccca4e37b1a912da3db68aee826557ea66145273 upstream. check the "not truncated yet" case Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: fix safe completionYan, Zheng
commit eb845ff13a44477f8a411baedbf11d678b9daf0a upstream. handle_reply() calls complete_request() only if the first OSD reply has ONDISK flag. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: protect against concurrent unmapsAlex Elder
commit 82a442d239695a242c4d584464c9606322cd02aa upstream. Make sure two concurrent unmap operations on the same rbd device won't collide, by only proceeding with the removal and cleanup of a device if is not already underway. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: set removing flag while holding list lockAlex Elder
commit 751cc0e3cfabdda87c4c21519253c6751e97a8d4 upstream. When unmapping a device, its id is supplied, and that is used to look up which rbd device should be unmapped. Looking up the device involves searching the rbd device list while holding a spinlock that protects access to that list. Currently all of this is done under protection of the control lock, but that protection is going away soon. To ensure the rbd_dev is still valid (still on the list) while setting its REMOVING flag, do so while still holding the list lock. To do so, get rid of __rbd_get_dev(), and open code what it did in the one place it was used. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09rbd: flush dcache after zeroing page dataAlex Elder
commit e215605417b87732c6debf65da6d953016a1e5bc upstream. Neither zero_bio_chain() nor zero_pages() contains a call to flush caches after zeroing a portion of a page. This can cause problems on architectures that have caches that allow virtual address aliasing. This resolves: http://tracker.ceph.com/issues/4777 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09libceph: add lingering request reference when registeredAlex Elder
commit 96e4dac66f69d28af2b736e723364efbbdf9fdee upstream. When an osd request is set to linger, the osd client holds onto the request so it can be re-submitted following certain osd map changes. The osd client holds a reference to the request until it is unregistered. This is used by rbd for watch requests. Currently, the reference is taken when the request is marked with the linger flag. This means that if an error occurs after that time but before the the request completes successfully, that reference is leaked. There's really no reason to take the reference until the request is registered in the the osd client's list of lingering requests, and that only happens when the lingering (watch) request completes successfully. So take that reference only when it gets registered following succesful completion, and drop it (as before) when the request gets unregistered. This avoids the reference problem on error in rbd. Rearrange ceph_osdc_unregister_linger_request() to avoid using the request pointer after it may have been freed. And hold an extra reference in kick_requests() while handling a linger request that has not yet been registered, to ensure it doesn't go away. This resolves: http://tracker.ceph.com/issues/3859 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: improve error handling in ceph_mdsmap_decodeEmil Goode
commit c213b50b7dcbf06abcfbf1e4eee5b76586718bd9 upstream. This patch makes the following improvements to the error handling in the ceph_mdsmap_decode function: - Add a NULL check for return value from kcalloc - Make use of the variable err Signed-off-by: Emil Goode <emilgoode@gmail.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09clocksource: dw_apb_timer_of: Fix read_sched_clockDinh Nguyen
commit 85dc6ee1237c8a4a7742e6abab96a20389b7d682 upstream. The read_sched_clock should return the ~value because the clock is a countdown implementation. read_sched_clock() should be the same as __apbt_read_clocksource(). Signed-off-by: Dinh Nguyen <dinguyen@altera.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09selinux: process labeled IPsec TCP SYN-ACK packets properly in ↵Paul Moore
selinux_ip_postroute() commit c0828e50485932b7e019df377a6b0a8d1ebd3080 upstream. Due to difficulty in arriving at the proper security label for TCP SYN-ACK packets in selinux_ip_postroute(), we need to check packets while/before they are undergoing XFRM transforms instead of waiting until afterwards so that we can determine the correct security label. Reported-by: Janak Desai <Janak.Desai@gtri.gatech.edu> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09selinux: look for IPsec labels on both inbound and outbound packetsPaul Moore
commit 817eff718dca4e54d5721211ddde0914428fbb7c upstream. Previously selinux_skb_peerlbl_sid() would only check for labeled IPsec security labels on inbound packets, this patch enables it to check both inbound and outbound traffic for labeled IPsec security labels. Reported-by: Janak Desai <Janak.Desai@gtri.gatech.edu> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09sh: always link in helper functions extracted from libgccGeert Uytterhoeven
commit 84ed8a99058e61567f495cc43118344261641c5f upstream. E.g. landisk_defconfig, which has CONFIG_NTFS_FS=m: ERROR: "__ashrdi3" [fs/ntfs/ntfs.ko] undefined! For "lib-y", if no symbols in a compilation unit are referenced by other units, the compilation unit will not be included in vmlinux. This breaks modules that do reference those symbols. Use "obj-y" instead to fix this. http://kisskb.ellerman.id.au/kisskb/buildresult/8838077/ This doesn't fix all cases. There are others, e.g. udivsi3. This is also not limited to sh, many architectures handle this in the same way. A simple solution is to unconditionally include all helper functions. A more complex solution is to make the choice of "lib-y" or "obj-y" depend on CONFIG_MODULES: obj-$(CONFIG_MODULES) += ... lib-y($CONFIG_MODULES) += ... Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Paul Mundt <lethal@linux-sh.org> Tested-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com> Reviewed-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09gpio: msm: Fix irq mask/unmask by writing bits instead of numbersStephen Boyd
commit 4cc629b7a20945ce35628179180329b6bc9e552b upstream. We should be writing bits here but instead we're writing the numbers that correspond to the bits we want to write. Fix it by wrapping the numbers in the BIT() macro. This fixes gpios acting as interrupts. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09gpio: twl4030: Fix regression for twl gpio LED outputRoger Quadros
commit f5837ec11f8cfa6d53ebc5806582771b2c9988c6 upstream. Commit 0b2aa8be introduced a regression that causes failure in setting LED GPO direction to OUT. This causes USB host probe failures for Beagleboard C4. platform usb_phy_gen_xceiv.2: Driver usb_phy_gen_xceiv requests probe deferral hsusb2_vcc: Failed to request enable GPIO510: -22 reg-fixed-voltage reg-fixed-voltage.0.auto: Failed to register regulator: -22 reg-fixed-voltage: probe of reg-fixed-voltage.0.auto failed with error -22 direction_out/direction_in must return 0 if the operation succeeded. Also, don't update direction flag and output data if twl4030_set_gpio_direction() failed inside twl_direction_out(); Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09jbd2: don't BUG but return ENOSPC if a handle runs out of spaceTheodore Ts'o
commit f6c07cad081ba222d63623d913aafba5586c1d2c upstream. If a handle runs out of space, we currently stop the kernel with a BUG in jbd2_journal_dirty_metadata(). This makes it hard to figure out what might be going on. So return an error of ENOSPC, so we can let the file system layer figure out what is going on, to make it more likely we can get useful debugging information). This should make it easier to debug problems such as the one which was reported by: https://bugzilla.kernel.org/show_bug.cgi?id=44731 The only two callers of this function are ext4_handle_dirty_metadata() and ocfs2_journal_dirty(). The ocfs2 function will trigger a BUG_ON(), which means there will be no change in behavior. The ext4 function will call ext4_error_inode() which will print the useful debugging information and then handle the situation using ext4's error handling mechanisms (i.e., which might mean halting the kernel or remounting the file system read-only). Also, since both file systems already call WARN_ON(), drop the WARN_ON from jbd2_journal_dirty_metadata() to avoid two stack traces from being displayed. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ocfs2-devel@oss.oracle.com Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09s390/3270: fix allocation of tty3270_screen structureMartin Schwidefsky
commit 36d9f4d3b68c7035ead3850dc85f310a579ed0eb upstream. The tty3270_alloc_screen function is called from tty3270_install with swapped arguments, the number of columns instead of rows and vice versa. The number of rows is typically smaller than the number of columns which makes the screen array too big but the individual cell arrays for the lines too small. Creating lines longer than the number of rows will clobber the memory after the end of the cell array. The fix is simple, call tty3270_alloc_screen with the correct argument order. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09memcg: fix memcg_size() calculationVladimir Davydov
commit 695c60830764945cf61a2cc623eb1392d137223e upstream. The mem_cgroup structure contains nr_node_ids pointers to mem_cgroup_per_node objects, not the objects themselves. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09GFS2: Fix incorrect invalidation for DIO/buffered I/OSteven Whitehouse
commit dfd11184d894cd0a92397b25cac18831a1a6a5bc upstream. In patch 209806aba9d540dde3db0a5ce72307f85f33468f we allowed local deferred locks to be granted against a cached exclusive lock. That opened up a corner case which this patch now fixes. The solution to the problem is to check whether we have cached pages each time we do direct I/O and if so to unmap, flush and invalidate those pages. Since the glock state machine normally does that for us, mostly the code will be a no-op. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09GFS2: don't hold s_umount over blkdev_putSteven Whitehouse
commit dfe5b9ad83a63180f358b27d1018649a27b394a9 upstream. This is a GFS2 version of Tejun's patch: 4f331f01b9c43bf001d3ffee578a97a1e0633eac vfs: don't hold s_umount over close_bdev_exclusive() call In this case its blkdev_put itself that is the issue and this patch uses the same solution of dropping and retaking s_umount. Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09Input: allocate absinfo data when setting ABS capabilityDmitry Torokhov
commit 28a2a2e1aedbe2d8b2301e6e0e4e63f6e4177aca upstream. We need to make sure we allocate absinfo data when we are setting one of EV_ABS/ABS_XXX capabilities, otherwise we may bomb when we try to emit this event. Rested-by: Paul Cercueil <pcercuei@gmail.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/memory-failure.c: transfer page count from head page to tail page after ↵Naoya Horiguchi
split thp commit a3e0f9e47d5ef7858a26cc12d90ad5146e802d47 upstream. Memory failures on thp tail pages cause kernel panic like below: mce: [Hardware Error]: Machine check events logged MCE exception done on CPU 7 BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: [<ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0 PGD bae42067 PUD ba47d067 PMD 0 Oops: 0000 [#1] SMP ... CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G M O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25 ... Call Trace: me_huge_page+0x3e/0x50 memory_failure+0x4bb/0xc20 mce_process_work+0x3e/0x70 process_one_work+0x171/0x420 worker_thread+0x11b/0x3a0 ? manage_workers.isra.25+0x2b0/0x2b0 kthread+0xe4/0x100 ? kthread_create_on_node+0x190/0x190 ret_from_fork+0x7c/0xb0 ? kthread_create_on_node+0x190/0x190 ... RIP dequeue_hwpoisoned_huge_page+0x131/0x1e0 CR2: 0000000000000058 The reasoning of this problem is shown below: - when we have a memory error on a thp tail page, the memory error handler grabs a refcount of the head page to keep the thp under us. - Before unmapping the error page from processes, we split the thp, where page refcounts of both of head/tail pages don't change. - Then we call try_to_unmap() over the error page (which was a tail page before). We didn't pin the error page to handle the memory error, this error page is freed and removed from LRU list. - We never have the error page on LRU list, so the first page state check returns "unknown page," then we move to the second check with the saved page flag. - The saved page flag have PG_tail set, so the second page state check returns "hugepage." - We call me_huge_page() for freed error page, then we hit the above panic. The root cause is that we didn't move refcount from the head page to the tail page after split thp. So this patch suggests to do this. This panic was introduced by commit 524fca1e73 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages"). Note that we did have the same refcount problem before this commit, but it was just ignored because we had only first page state check which returned "unknown page." The commit changed the refcount problem from "doesn't work" to "kernel panic." Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: fix use-after-free in sys_remap_file_pagesRik van Riel
commit 4eb919825e6c3c7fb3630d5621f6d11e98a18b3a upstream. remap_file_pages calls mmap_region, which may merge the VMA with other existing VMAs, and free "vma". This can lead to a use-after-free bug. Avoid the bug by remembering vm_flags before calling mmap_region, and not trying to dereference vma later. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Kees Cook <keescook@chromium.org> Cc: Michel Lespinasse <walken@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/hugetlb: check for pte NULL pointer in __page_check_address()Jianguo Wu
commit 98398c32f6687ee1e1f3ae084effb4b75adb0747 upstream. In __page_check_address(), if address's pud is not present, huge_pte_offset() will return NULL, we should check the return value. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: qiuxishi <qiuxishi@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/compaction: respect ignore_skip_hint in update_pageblock_skipJoonsoo Kim
commit 6815bf3f233e0b10c99a758497d5d236063b010b upstream. update_pageblock_skip() only fits to compaction which tries to isolate by pageblock unit. If isolate_migratepages_range() is called by CMA, it try to isolate regardless of pageblock unit and it don't reference get_pageblock_skip() by ignore_skip_hint. We should also respect it on update_pageblock_skip() to prevent from setting the wrong information. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: numa: guarantee that tlb_flush_pending updates are visible before page ↵Mel Gorman
table updates commit af2c1401e6f9177483be4fad876d0073669df9df upstream. According to documentation on barriers, stores issued before a LOCK can complete after the lock implying that it's possible tlb_flush_pending can be visible after a page table update. As per revised documentation, this patch adds a smp_mb__before_spinlock to guarantee the correct ordering. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: fix TLB flush race between migration, and change_protection_rangeRik van Riel
commit 20841405940e7be0617612d521e206e4b6b325db upstream. There are a few subtle races, between change_protection_range (used by mprotect and change_prot_numa) on one side, and NUMA page migration and compaction on the other side. The basic race is that there is a time window between when the PTE gets made non-present (PROT_NONE or NUMA), and the TLB is flushed. During that time, a CPU may continue writing to the page. This is fine most of the time, however compaction or the NUMA migration code may come in, and migrate the page away. When that happens, the CPU may continue writing, through the cached translation, to what is no longer the current memory location of the process. This only affects x86, which has a somewhat optimistic pte_accessible. All other architectures appear to be safe, and will either always flush, or flush whenever there is a valid mapping, even with no permissions (SPARC). The basic race looks like this: CPU A CPU B CPU C load TLB entry make entry PTE/PMD_NUMA fault on entry read/write old page start migrating page change PTE/PMD to new page read/write old page [*] flush TLB reload TLB from new entry read/write new page lose data [*] the old page may belong to a new user at this point! The obvious fix is to flush remote TLB entries, by making sure that pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may still be accessible if there is a TLB flush pending for the mm. This should fix both NUMA migration and compaction. [mgorman@suse.de: fix build] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09sched: fix the theoretical signal_wake_up() vs schedule() raceOleg Nesterov
commit e0acd0a68ec7dbf6b7a81a87a867ebd7ac9b76c4 upstream. This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if(signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>