aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2014-06-11tty/serial: Add support for Altera serial portLey Foon Tan
commit e06c93cacb82dd147266fd1bdb2d0a0bd45ff2c1 upstream. Add support for Altera 8250/16550 compatible serial port. Signed-off-by: Ley Foon Tan <lftan@altera.com> [xr: Backported to 3.4: adjust filenames, context] Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-118250/16?50: Add support for Broadcom TruManage redirected serial portStephen Hurd
commit ebebd49a8eab5e9aa1b1f8f1614ccc3c2120f886 upstream. Add support for the UART device present in Broadcom TruManage capable NetXtreme chips (ie: 5761m 5762, and 5725). This implementation has a hidden transmit FIFO, so running in single-byte interrupt mode results in too many interrupts. The UART_CAP_HFIFO capability was added to track this. It continues to reload the THR as long as the THRE and TSRE bits are set in the LSR up to a specified limit (1024 is used here). Signed-off-by: Stephen Hurd <shurd@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> [xr: Backported to 3.4: - Adjust filenames - Adjust context - PORT_BRCM_TRUMANAGE is 22 not 24] Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11hpsa: gen8plus Smart Array IDsMike Miller
commit fe0c9610bb68dd0aad1017456f5e3c31264d70c2 upstream. Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11nfsd: check passed socket's net matches NFSd superblock's oneStanislav Kinsbursky
commit 3064639423c48d6e0eb9ecc27c512a58e38c6c57 upstream. There could be a case, when NFSd file system is mounted in network, different to socket's one, like below: "ip netns exec" creates new network and mount namespace, which duplicates NFSd mount point, created in init_net context. And thus NFS server stop in nested network context leads to RPCBIND client destruction in init_net. Then, on NFSd start in nested network context, rpc.nfsd process creates socket in nested net and passes it into "write_ports", which leads to RPCBIND sockets creation in init_net context because of the same reason (NFSd monut point was created in init_net context). An attempt to register passed socket in nested net leads to panic, because no RPCBIND client present in nexted network namespace. This patch add check that passed socket's net matches NFSd superblock's one. And returns -EINVAL error to user psace otherwise. v2: Put socket on exit. Reported-by: Weng Meiling <wengmeiling.weng@huawei.com> Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> [wengmeiling: backport to 3.4: adjust context] Signed-off-by: Weng Meiling <wengmeiling.weng@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11virtio_console: fix uapi headerMichael S. Tsirkin
commit 6407d75afd08545f2252bb39806ffd3f10c7faac upstream. uapi should use __u32 not u32. Fix a macro in virtio_console.h which uses u32. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11mm: add kmap_to_page()Ben Hutchings
commit fcb8996728fb59eddf84678df7cb213b2c9a2e26 upstream. This is extracted from Mel Gorman's commit 5a178119b0fb ('mm: add support for direct_IO to highmem pages') upstream. Required to backport commit b9cdc88df8e6 ('virtio: 9p: correctly pass physical address to userspace for high pages'). Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07xen-netfront: reduce gso_max_size to account for max TCP headerWei Liu
commit 9ecd1a75d977e2e8c48139c7d3efed183f898d94 upstream. The maximum packet including header that can be handled by netfront / netback wire format is 65535. Reduce gso_max_size accordingly. Drop skb and print warning when skb->len > 65535. This can 1) save the effort to send malformed packet to netback, 2) help spotting misconfiguration of netfront in the future. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [hq: Backported to 3.4: adjust context] Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07net: Add net_ratelimited_function and net_<level>_ratelimited macrosJoe Perches
commit 3a3bfb61e64476ff1e4ac3122cb6dec9c79b795c upstream. __ratelimit() can be considered an inverted bool test because it returns true when not ratelimited. Several tests in the kernel tree use this __ratelimit() function incorrectly. No net_ratelimit uses are incorrect currently though. Most uses of net_ratelimit are to log something via printk or pr_<level>. In order to minimize the uses of net_ratelimit, and to start standardizing the code style used for __ratelimit() and net_ratelimit(), add a net_ratelimited_function() macro and net_<level>_ratelimited() logging macros similar to pr_<level>_ratelimited that use the global net_ratelimit instead of a static per call site "struct ratelimit_state". Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07drm: Pad drm_mode_get_connector to 64-bit boundaryChris Wilson
commit bc5bd37ce48c66e9192ad2e7231e9678880f6f8e upstream. Pavel Roskin reported that DRM_IOCTL_MODE_GETCONNECTOR was overwritting the 4 bytes beyond the end of its structure with a 32-bit userspace running on a 64-bit kernel. This is due to the padding gcc inserts as the drm_mode_get_connector struct includes a u64 and its size is not a natural multiple of u64s. 64-bit kernel: sizeof(drm_mode_get_connector)=80, alignof=8 sizeof(drm_mode_get_encoder)=20, alignof=4 sizeof(drm_mode_modeinfo)=68, alignof=4 32-bit userspace: sizeof(drm_mode_get_connector)=76, alignof=4 sizeof(drm_mode_get_encoder)=20, alignof=4 sizeof(drm_mode_modeinfo)=68, alignof=4 Fortuituously we can insert explicit padding to the tail of our structures without breaking ABI. Reported-by: Pavel Roskin <proski@gnu.org> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Dave Airlie <airlied@redhat.com> Cc: dri-devel@lists.freedesktop.org Signed-off-by: Dave Airlie <airlied@redhat.com> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Weng Meiling <wengmeiling.weng@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07x86, efivars: firmware bug workarounds should be in platform codeMatt Fleming
commit a6e4d5a03e9e3587e88aba687d8f225f4f04c792 upstream. Let's not burden ia64 with checks in the common efivars code that we're not writing too much data to the variable store. That kind of thing is an x86 firmware bug, plain and simple. efi_query_variable_store() provides platforms with a wrapper in which they can perform checks and workarounds for EFI variable storage bugs. Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> [xr: Backported to 3.4: adjust context] Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07efi_pstore: Introducing workqueue updating sysfsSeiji Aguchi
commit a93bc0c6e07ed9bac44700280e65e2945d864fd4 upstream. [Problem] efi_pstore creates sysfs entries, which enable users to access to NVRAM, in a write callback. If a kernel panic happens in an interrupt context, it may fail because it could sleep due to dynamic memory allocations during creating sysfs entries. [Patch Description] This patch removes sysfs operations from a write callback by introducing a workqueue updating sysfs entries which is scheduled after the write callback is called. Also, the workqueue is kicked in a just oops case. A system will go down in other cases such as panic, clean shutdown and emergency restart. And we don't need to create sysfs entries because there is no chance for users to access to them. efi_pstore will be robust against a kernel panic in an interrupt context with this patch. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> [xr: Backported to 3.4: - Adjust contest - Remove repeated definition of helper function variable_is_present] Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07efi: be more paranoid about available space when creating variablesJosh Boyer
commit 68d929862e29a8b52a7f2f2f86a0600423b093cd upstream. UEFI variables are typically stored in flash. For various reasons, avaiable space is typically not reclaimed immediately upon the deletion of a variable - instead, the system will garbage collect during initialisation after a reboot. Some systems appear to handle this garbage collection extremely poorly, failing if more than 50% of the system flash is in use. This can result in the machine refusing to boot. The safest thing to do for the moment is to forbid writes if they'd end up using more than half of the storage space. We can make this more finegrained later if we come up with a method for identifying the broken machines. Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> [bwh: Backported to 3.2: - Drop efivarfs changes and unused check_var_size() - Add error codes to include/linux/efi.h, added upstream by commit 5d9db883761a ('efi: Add support for a UEFI variable filesystem') - Add efi_status_to_err(), added upstream by commit 7253eaba7b17 ('efivarfs: Return an error if we fail to read a variable')] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07efi_pstore: Check remaining space with QueryVariableInfo() before writing dataSeiji Aguchi
commit d80a361d779a9f19498943d1ca84243209cd5647 upstream. [Issue] As discussed in a thread below, Running out of space in EFI isn't a well-tested scenario. And we wouldn't expect all firmware to handle it gracefully. http://marc.info/?l=linux-kernel&m=134305325801789&w=2 On the other hand, current efi_pstore doesn't check a remaining space of storage at writing time. Therefore, efi_pstore may not work if it tries to write a large amount of data. [Patch Description] To avoid handling the situation above, this patch checks if there is a space enough to log with QueryVariableInfo() before writing data. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Mike Waychison <mikew@google.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07x86 get_unmapped_area: Access mmap_legacy_base through mm_struct memberRadu Caragea
commit 41aacc1eea645c99edbe8fbcf78a97dc9b862adc upstream. This is the updated version of df54d6fa5427 ("x86 get_unmapped_area(): use proper mmap base for bottom-up direction") that only randomizes the mmap base address once. Signed-off-by: Radu Caragea <sinaelgl@gmail.com> Reported-and-tested-by: Jeff Shorey <shoreyjeff@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Adrian Sendroiu <molecula2788@gmail.com> Cc: Greg KH <greg@kroah.com> Cc: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07x86, build, icc: Remove uninitialized_var() from compiler-intel.hH. Peter Anvin
commit 503cf95c061a0551eb684da364509297efbe55d9 upstream. When compiling with icc, <linux/compiler-gcc.h> ends up included because the icc environment defines __GNUC__. Thus, we neither need nor want to have this macro defined in both compiler-gcc.h and compiler-intel.h, and the fact that they are inconsistent just makes the compiler spew warnings. Reported-by: Sunil K. Pandey <sunil.k.pandey@intel.com> Cc: Kevin B. Smith <kevin.b.smith@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/n/tip-0mbwou1zt7pafij09b897lg3@git.kernel.org [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07mac80211: introduce IEEE80211_HW_TEARDOWN_AGGR_ON_BAR_FAILStanislaw Gruszka
commit 5b632fe85ec82e5c43740b52e74c66df50a37db3 upstream. Commit f0425beda4d404a6e751439b562100b902ba9c98 "mac80211: retry sending failed BAR frames later instead of tearing down aggr" caused regression on rt2x00 hardware (connection hangs). This regression was fixed by commit be03d4a45c09ee5100d3aaaedd087f19bc20d01 "rt2x00: Don't let mac80211 send a BAR when an AMPDU subframe fails". But the latter commit caused yet another problem reported in https://bugzilla.kernel.org/show_bug.cgi?id=42828#c22 After long discussion in this thread: http://mid.gmane.org/20121018075615.GA18212@redhat.com and testing various alternative solutions, which failed on one or other setup, we have no other good fix for the issues like just revert both mentioned earlier commits. To do not affect other hardware which benefit from commit f0425beda4d404a6e751439b562100b902ba9c98, instead of reverting it, introduce flag that when used will restore mac80211 behaviour before the commit. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> [replaced link with mid.gmane.org that has message-id] Signed-off-by: Johannes Berg <johannes.berg@intel.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [hq: Backported to 3.4: adjust context] Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07perf: Fix perf ring buffer memory orderingPeter Zijlstra
commit bf378d341e4873ed928dc3c636252e6895a21f50 upstream. The PPC64 people noticed a missing memory barrier and crufty old comments in the perf ring buffer code. So update all the comments and add the missing barrier. When the architecture implements local_t using atomic_long_t there will be double barriers issued; but short of introducing more conditional barrier primitives this is the best we can do. Reported-by: Victor Kaplansky <victork@il.ibm.com> Tested-by: Victor Kaplansky <victork@il.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: michael@ellerman.id.au Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: anton@samba.org Cc: benh@kernel.crashing.org Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07trace: module: Maintain a valid user countRomain Izard
commit 098507ae3ec2331476fb52e85d4040c1cc6d0ef4 upstream. The replacement of the 'count' variable by two variables 'incs' and 'decs' to resolve some race conditions during module unloading was done in parallel with some cleanup in the trace subsystem, and was integrated as a merge. Unfortunately, the formula for this replacement was wrong in the tracing code, and the refcount in the traces was not usable as a result. Use 'count = incs - decs' to compute the user count. Link: http://lkml.kernel.org/p/1393924179-9147-1-git-send-email-romain.izard.pro@gmail.com Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Fixes: c1ab9cab7509 "merge conflict resolution" Signed-off-by: Romain Izard <romain.izard.pro@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07ftrace/module: Hardcode ftrace_module_init() call into load_module()Steven Rostedt (Red Hat)
commit a949ae560a511fe4e3adf48fa44fefded93e5c2b upstream. A race exists between module loading and enabling of function tracer. CPU 1 CPU 2 ----- ----- load_module() module->state = MODULE_STATE_COMING register_ftrace_function() mutex_lock(&ftrace_lock); ftrace_startup() update_ftrace_function(); ftrace_arch_code_modify_prepare() set_all_module_text_rw(); <enables-ftrace> ftrace_arch_code_modify_post_process() set_all_module_text_ro(); [ here all module text is set to RO, including the module that is loading!! ] blocking_notifier_call_chain(MODULE_STATE_COMING); ftrace_init_module() [ tries to modify code, but it's RO, and fails! ftrace_bug() is called] When this race happens, ftrace_bug() will produces a nasty warning and all of the function tracing features will be disabled until reboot. The simple solution is to treate module load the same way the core kernel is treated at boot. To hardcode the ftrace function modification of converting calls to mcount into nops. This is done in init/main.c there's no reason it could not be done in load_module(). This gives a better control of the changes and doesn't tie the state of the module to its notifiers as much. Ftrace is special, it needs to be treated as such. The reason this would work, is that the ftrace_module_init() would be called while the module is in MODULE_STATE_UNFORMED, which is ignored by the set_all_module_text_ro() call. Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.com Reported-by: Takao Indoh <indou.takao@jp.fujitsu.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07skb: Add inline helper for getting the skb end offset from headAlexander Duyck
[ Upstream commit ec47ea82477404631d49b8e568c71826c9b663ac ] With the recent changes for how we compute the skb truesize it occurs to me we are probably going to have a lot of calls to skb_end_pointer - skb->head. Instead of running all over the place doing that it would make more sense to just make it a separate inline skb_end_offset(skb) that way we can return the correct value without having gcc having to do all the optimization to cancel out skb->head - skb->head. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07ipv6: Limit mtu to 65575 bytesEric Dumazet
[ Upstream commit 30f78d8ebf7f514801e71b88a10c948275168518 ] Francois reported that setting big mtu on loopback device could prevent tcp sessions making progress. We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets. We must limit the IPv6 MTU to (65535 + 40) bytes in theory. Tested: ifconfig lo mtu 70000 netperf -H ::1 Before patch : Throughput : 0.05 Mbits After patch : Throughput : 35484 Mbits Reported-by: Francois WELLENREITER <f.wellenreiter@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07list: introduce list_next_entry() and list_prev_entry()Oleg Nesterov
[ Upstream commit 008208c6b26f21c2648c250a09c55e737c02c5f8 ] Add two trivial helpers list_next_entry() and list_prev_entry(), they can have a lot of users including list.h itself. In fact the 1st one is already defined in events/core.c and bnx2x_sp.c, so the patch simply moves the definition to list.h. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eilon Greenstein <eilong@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-18net: Add net_ratelimited_function and net_<level>_ratelimited macrosJoe Perches
commit 3a3bfb61e64476ff1e4ac3122cb6dec9c79b795c upstream. __ratelimit() can be considered an inverted bool test because it returns true when not ratelimited. Several tests in the kernel tree use this __ratelimit() function incorrectly. No net_ratelimit uses are incorrect currently though. Most uses of net_ratelimit are to log something via printk or pr_<level>. In order to minimize the uses of net_ratelimit, and to start standardizing the code style used for __ratelimit() and net_ratelimit(), add a net_ratelimited_function() macro and net_<level>_ratelimited() logging macros similar to pr_<level>_ratelimited that use the global net_ratelimit instead of a static per call site "struct ratelimit_state". Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-18netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->lenAndrey Vagin
commit 223b02d923ecd7c84cf9780bb3686f455d279279 upstream. "len" contains sizeof(nf_ct_ext) and size of extensions. In a worst case it can contain all extensions. Bellow you can find sizes for all types of extensions. Their sum is definitely bigger than 256. nf_ct_ext_types[0]->len = 24 nf_ct_ext_types[1]->len = 32 nf_ct_ext_types[2]->len = 24 nf_ct_ext_types[3]->len = 32 nf_ct_ext_types[4]->len = 152 nf_ct_ext_types[5]->len = 2 nf_ct_ext_types[6]->len = 16 nf_ct_ext_types[7]->len = 8 I have seen "len" up to 280 and my host has crashes w/o this patch. The right way to fix this problem is reducing the size of the ecache extension (4) and Florian is going to do this, but these changes will be quite large to be appropriate for a stable tree. Fixes: 5b423f6a40a0 (netfilter: nf_conntrack: fix racy timer handling with reliable) Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-18blktrace: fix accounting of partially completed requestsRoman Pen
commit af5040da01ef980670b3741b3e10733ee3e33566 upstream. trace_block_rq_complete does not take into account that request can be partially completed, so we can get the following incorrect output of blkparser: C R 232 + 240 [0] C R 240 + 232 [0] C R 248 + 224 [0] C R 256 + 216 [0] but should be: C R 232 + 8 [0] C R 240 + 8 [0] C R 248 + 8 [0] C R 256 + 8 [0] Also, the whole output summary statistics of completed requests and final throughput will be incorrect. This patch takes into account real completion size of the request and fixes wrong completion accounting. Signed-off-by: Roman Pen <r.peniaev@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@redhat.com> CC: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13libata/ahci: accommodate tag ordered controllersDan Williams
commit 8a4aeec8d2d6a3edeffbdfae451cdf05cbf0fefd upstream. The AHCI spec allows implementations to issue commands in tag order rather than FIFO order: 5.3.2.12 P:SelectCmd HBA sets pSlotLoc = (pSlotLoc + 1) mod (CAP.NCS + 1) or HBA selects the command to issue that has had the PxCI bit set to '1' longer than any other command pending to be issued. The result is that commands posted sequentially (time-wise) may play out of sequence when issued by hardware. This behavior has likely been hidden by drives that arrange for commands to complete in issue order. However, it appears recent drives (two from different vendors that we have found so far) inflict out-of-order completions as a matter of course. So, we need to take care to maintain ordered submission, otherwise we risk triggering a drive to fall out of sequential-io automation and back to random-io processing, which incurs large latency and degrades throughput. This issue was found in simple benchmarks where QD=2 seq-write performance was 30-50% *greater* than QD=32 seq-write performance. Tagging for -stable and making the change globally since it has a low risk-to-reward ratio. Also, word is that recent versions of an unnamed OS also does it this way now. So, drives in the field are already experienced with this tag ordering scheme. Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ed Ciechanowski <ed.ciechanowski@intel.com> Reviewed-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14USB: fix build error when CONFIG_PM_SLEEP isn't enabledAlan Stern
commit 9d8924297cd9c256c23c02abae40202563452453 upstream. This patch fixes a build error that occurs when CONFIG_PM is enabled and CONFIG_PM_SLEEP isn't: >> drivers/usb/host/ohci-pci.c:294:10: error: 'usb_hcd_pci_pm_ops' undeclared here (not in a function) .pm = &usb_hcd_pci_pm_ops Since the usb_hcd_pci_pm_ops structure is defined and used when CONFIG_PM is enabled, its declaration should not be protected by CONFIG_PM_SLEEP. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14USB: serial: add modem-status-change wait queueJohan Hovold
commit e5b33dc9d16053c2ae4c2c669cf008829530364b upstream. Add modem-status-change wait queue to struct usb_serial_port that subdrivers can use to implement TIOCMIWAIT. Currently subdrivers use a private wait queue which may have been released when waking up after device disconnected. Note that we're adding a new wait queue rather than reusing the tty-port one as we do not want to get woken up at hangup (yet). Signed-off-by: Johan Hovold <jhovold@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14ARM: Orion: Set eth packet size csum offload limitArnaud Patard (Rtp)
commit 58569aee5a1a5dcc25c34a0a2ed9a377874e6b05 upstream. The mv643xx ethernet controller limits the packet size for the TX checksum offloading. This patch sets this limits for Kirkwood and Dove which have smaller limits that the default. As a side note, this patch is an updated version of a patch sent some years ago: http://lists.infradead.org/pipermail/linux-arm-kernel/2010-June/017320.html which seems to have been lost. Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org> Signed-off-by: Jason Cooper <jason@lakedaemon.net> [bwh: Backported to 3.2: adjust for the extra two parameters of orion_ge0{0,1}_init()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [yangyl: Backported to 3.4: Adjust context] Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14HID: fix return value of hidraw_report_event() when !CONFIG_HIDRAWJiri Kosina
commit d6d7c873529abd622897cad5e36f1fd7d82f5110 upstream. Commit b6787242f327 ("HID: hidraw: add proper error handling to raw event reporting") forgot to update the static inline version of hidraw_report_event() for the case when CONFIG_HIDRAW is unset. Fix that up. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14HID: hidraw: add proper error handling to raw event reportingJiri Kosina
commit b6787242f32700377d3da3b8d788ab3928bab849 upstream. If kmemdup() in hidraw_report_event() fails, we are not propagating this fact properly. Let hidraw_report_event() and hid_report_raw_event() return an error value to the caller. Reported-by: Oliver Neukum <oneukum@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14pps: Add pps_lookup_dev() functionGeorge Spelvin
commit 513b032c98b4b9414aa4e9b4a315cb1bf0380101 upstream. The PPS serial line discipline wants to attach a PPS device to a tty without changing the tty code to add a struct pps_device * pointer. Since the number of PPS devices in a typical system is generally very low (n=1 is by far the most common), it's practical to search the entire list of allocated pps devices. (We capture the timestamp before the lookup, so the timing isn't affected.) It is a bit ugly that this function, which is part of the in-kernel PPS API, has to be in pps.c as opposed to kapi,c, but that's not something that affects users. Signed-off-by: George Spelvin <linux@horizon.com> Acked-by: Rodolfo Giometti <giometti@enneenne.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14idr: idr_for_each_entry() macroPhilipp Reisner
commit 9749f30f1a387070e6e8351f35aeb829eacc3ab6 upstream. Inspired by the list_for_each_entry() macro Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14ipc, msg: fix message length check for negative valuesMathias Krause
commit 4e9b45a19241354daec281d7a785739829b52359 upstream. On 64 bit systems the test for negative message sizes is bogus as the size, which may be positive when evaluated as a long, will get truncated to an int when passed to load_msg(). So a long might very well contain a positive value but when truncated to an int it would become negative. That in combination with a small negative value of msg_ctlmax (which will be promoted to an unsigned type for the comparison against msgsz, making it a big positive value and therefore make it pass the check) will lead to two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too small buffer as the addition of alen is effectively a subtraction. 2/ The copy_from_user() call in load_msg() will first overflow the buffer with userland data and then, when the userland access generates an access violation, the fixup handler copy_user_handle_tail() will try to fill the remainder with zeros -- roughly 4GB. That almost instantly results in a system crash or reset. ,-[ Reproducer (needs to be run as root) ]-- | #include <sys/stat.h> | #include <sys/msg.h> | #include <unistd.h> | #include <fcntl.h> | | int main(void) { | long msg = 1; | int fd; | | fd = open("/proc/sys/kernel/msgmax", O_WRONLY); | write(fd, "-1", 2); | close(fd); | | msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT); | | return 0; | } '--- Fix the issue by preventing msgsz from getting truncated by consistently using size_t for the message length. This way the size checks in do_msgsnd() could still be passed with a negative value for msg_ctlmax but we would fail on the buffer allocation in that case and error out. Also change the type of m_ts from int to size_t to avoid similar nastiness in other code paths -- it is used in similar constructs, i.e. signed vs. unsigned checks. It should never become negative under normal circumstances, though. Setting msg_ctlmax to a negative value is an odd configuration and should be prevented. As that might break existing userland, it will be handled in a separate commit so it could easily be reverted and reworked without reintroducing the above described bug. Hardening mechanisms for user copy operations would have catched that bug early -- e.g. checking slab object sizes on user copy operations as the usercopy feature of the PaX patch does. Or, for that matter, detect the long vs. int sign change due to truncation, as the size overflow plugin of the very same patch does. [akpm@linux-foundation.org: fix i386 min() warnings] Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Pax Team <pageexec@freemail.hu> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Brad Spengler <spender@grsecurity.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Adjust context - Drop changes to alloc_msg() and copy_msg(), which don't exist] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14compiler/gcc4: Add quirk for 'asm goto' miscompilation bugIngo Molnar
commit 3f0116c3238a96bc18ad4b4acefe4e7be32fa861 upstream. Fengguang Wu, Oleg Nesterov and Peter Zijlstra tracked down a kernel crash to a GCC bug: GCC miscompiles certain 'asm goto' constructs, as outlined here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 Implement a workaround suggested by Jakub Jelinek. Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Suggested-by: Jakub Jelinek <jakub@redhat.com> Reviewed-by: Richard Henderson <rth@twiddle.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [hq: Backported to 3.4: Adjust context] Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-14compiler-gcc.h: Add gcc-recommended GCC_VERSION macroDaniel Santos
commit 3f3f8d2f48acfd8ed3b8e6b7377935da57b27b16 upstream. Throughout compiler*.h, many version checks are made. These can be simplified by using the macro that gcc's documentation recommends. However, my primary reason for adding this is that I need bug-check macros that are enabled at certain gcc versions and it's cleaner to use this macro than the tradition method: #if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ => 2) If you add patch level, it gets this ugly: #if __GNUC__ > 4 || (__GNUC__ == 4 && (__GNUC_MINOR__ > 2 || \ __GNUC_MINOR__ == 2 __GNUC_PATCHLEVEL__ >= 1)) As opposed to: #if GCC_VERSION >= 40201 While having separate headers for gcc 3 & 4 eliminates some of this verbosity, they can still be cleaned up by this. See also: http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html Signed-off-by: Daniel Santos <daniel.santos@pobox.com> Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Joe Perches <joe@perches.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-03ext4: atomically set inode->i_flags in ext4_set_inode_flags()Theodore Ts'o
commit 00a1a053ebe5febcfc2ec498bd894f035ad2aa06 upstream. Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time. Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23jiffies: Avoid undefined behavior from signed overflowPaul E. McKenney
commit 5a581b367b5df0531265311fc681c2abd377e5e6 upstream. According to the C standard 3.4.3p3, overflow of a signed integer results in undefined behavior. This commit therefore changes the definitions of time_after(), time_after_eq(), time_after64(), and time_after_eq64() to avoid this undefined behavior. The trick is that the subtraction is done using unsigned arithmetic, which according to 6.2.5p9 cannot overflow because it is defined as modulo arithmetic. This has the added (though admittedly quite small) benefit of shortening four lines of code by four characters each. Note that the C standard considers the cast from unsigned to signed to be implementation-defined, see 6.3.1.3p3. However, on a two's-complement system, an implementation that defines anything other than a reinterpretation of the bits is free to come to me, and I will be happy to act as a witness for its being committed to an insane asylum. (Although I have nothing against saturating arithmetic or signals in some cases, these things really should not be the default when compiling an operating-system kernel.) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Kevin Easton <kevin@guarana.org> [ paulmck: Included time_after64() and time_after_eq64(), as suggested by Eric Dumazet, also fixed commit message.] Reviewed-by: Josh Triplett <josh@joshtriplett.org> Ruchi Kandoi <kandoiruchi@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23firewire: don't use PREPARE_DELAYED_WORKTejun Heo
commit 70044d71d31d6973665ced5be04ef39ac1c09a48 upstream. PREPARE_[DELAYED_]WORK() are being phased out. They have few users and a nasty surprise in terms of reentrancy guarantee as workqueue considers work items to be different if they don't have the same work function. firewire core-device and sbp2 have been been multiplexing work items with multiple work functions. Introduce fw_device_workfn() and sbp2_lu_workfn() which invoke fw_device->workfn and sbp2_logical_unit->workfn respectively and always use the two functions as the work functions and update the users to set the ->workfn fields instead of overriding work functions using PREPARE_DELAYED_WORK(). This fixes a variety of possible regressions since a2c1c57be8d9 "workqueue: consider work function when searching for busy work items" due to which fw_workqueue lost its required non-reentrancy property. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: linux1394-devel@lists.sourceforge.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23tracing: Do not add event files for modules that fail tracepointsSteven Rostedt (Red Hat)
commit 45ab2813d40d88fc575e753c38478de242d03f88 upstream. If a module fails to add its tracepoints due to module tainting, do not create the module event infrastructure in the debugfs directory. As the events will not work and worse yet, they will silently fail, making the user wonder why the events they enable do not display anything. Having a warning on module load and the events not visible to the users will make the cause of the problem much clearer. Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org Fixes: 6d723736e472 "tracing/events: add support for modules to TRACE_EVENT" Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-11xen/io/ring.h: new macro to detect whether there are too many requests on ↵Jan Beulich
the ring commit 8d9256906a97c24e97e016482b9be06ea2532b05 upstream. Backends may need to protect themselves against an insane number of produced requests stored by a frontend, in case they iterate over requests until reaching the req_prod value. There can't be more requests on the ring than the difference between produced requests and produced (but possibly not yet published) responses. This is a more strict alternative to a patch previously posted by Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-11xen-netback: coalesce slots in TX path and fix regressionsWei Liu
commit 2810e5b9a7731ca5fce22bfbe12c96e16ac44b6f upstream. This patch tries to coalesce tx requests when constructing grant copy structures. It enables netback to deal with situation when frontend's MAX_SKB_FRAGS is larger than backend's MAX_SKB_FRAGS. With the help of coalescing, this patch tries to address two regressions avoid reopening the security hole in XSA-39. Regression 1. The reduction of the number of supported ring entries (slots) per packet (from 18 to 17). This regression has been around for some time but remains unnoticed until XSA-39 security fix. This is fixed by coalescing slots. Regression 2. The XSA-39 security fix turning "too many frags" errors from just dropping the packet to a fatal error and disabling the VIF. This is fixed by coalescing slots (handling 18 slots when backend's MAX_SKB_FRAGS is 17) which rules out false positive (using 18 slots is legit) and dropping packets using 19 to `max_skb_slots` slots. To avoid reopening security hole in XSA-39, frontend sending packet using more than max_skb_slots is considered malicious. The behavior of netback for packet is thus: 1-18 slots: valid 19-max_skb_slots slots: drop and respond with an error max_skb_slots+ slots: fatal error max_skb_slots is configurable by admin, default value is 20. Also change variable name from "frags" to "slots" in netbk_count_requests. Please note that RX path still has dependency on MAX_SKB_FRAGS. This will be fixed with separate patch. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-11nbd: correct disconnect behaviorPaul Clements
commit c378f70adbc1bbecd9e6db145019f14b2f688c7c upstream. Currently, when a disconnect is requested by the user (via NBD_DISCONNECT ioctl) the return from NBD_DO_IT is undefined (it is usually one of several error codes). This means that nbd-client does not know if a manual disconnect was performed or whether a network error occurred. Because of this, nbd-client's persist mode (which tries to reconnect after error, but not after manual disconnect) does not always work correctly. This change fixes this by causing NBD_DO_IT to always return 0 if a user requests a disconnect. This means that nbd-client can correctly either persist the connection (if an error occurred) or disconnect (if the user requested it). Signed-off-by: Paul Clements <paul.clements@steeleye.com> Acked-by: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [xr: Backported to 3.4: adjust context] Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-11ext4/jbd2: don't wait (forever) for stale tid caused by wraparoundTheodore Ts'o
commit d76a3a77113db020d9bb1e894822869410450bd9 upstream. In the case where an inode has a very stale transaction id (tid) in i_datasync_tid or i_sync_tid, it's possible that after a very large (2**31) number of transactions, that the tid number space might wrap, causing tid_geq()'s calculations to fail. Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", attempted to fix this problem, but it only avoided kjournald spinning forever by fixing the logic in jbd2_log_start_commit(). Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c that might call jbd2_log_start_commit() with a stale tid, those functions will subsequently call jbd2_log_wait_commit() with the same stale tid, and then wait for a very long time. To fix this, we replace the calls to jbd2_log_start_commit() and jbd2_log_wait_commit() with a call to a new function, jbd2_complete_transaction(), which will correctly handle stale tid's. As a bonus, jbd2_complete_transaction() will avoid locking j_state_lock for writing unless a commit needs to be started. This should have a small (but probably not measurable) improvement for ext4's scalability. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: George Barnett <gbarnett@atlassian.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-11cgroup: fix RCU accesses to task->cgroupsTejun Heo
commit 14611e51a57df10240817d8ada510842faf0ec51 upstream. task->cgroups is a RCU pointer pointing to struct css_set. A task switches to a different css_set on cgroup migration but a css_set doesn't change once created and its pointers to cgroup_subsys_states aren't RCU protected. task_subsys_state[_check]() is the macro to acquire css given a task and subsys_id pair. It RCU-dereferences task->cgroups->subsys[] not task->cgroups, so the RCU pointer task->cgroups ends up being dereferenced without read_barrier_depends() after it. It's broken. Fix it by introducing task_css_set[_check]() which does RCU-dereference on task->cgroups. task_subsys_state[_check]() is reimplemented to directly dereference ->subsys[] of the css_set returned from task_css_set[_check](). This removes some of sparse RCU warnings in cgroup. v2: Fixed unbalanced parenthsis and there's no need to use rcu_dereference_raw() when !CONFIG_PROVE_RCU. Both spotted by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Li Zefan <lizefan@huawei.com> [bwh: Backported to 3.2: - Adjust context - Remove CONFIG_PROVE_RCU condition - s/lockdep_is_held(&cgroup_mutex)/cgroup_lock_is_held()/] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-11cgroup: cgroup_subsys->fork() should be called after the task is added to ↵Tejun Heo
css_set commit 5edee61edeaaebafe584f8fb7074c1ef4658596b upstream. cgroup core has a bug which violates a basic rule about event notifications - when a new entity needs to be added, you add that to the notification list first and then make the new entity conform to the current state. If done in the reverse order, an event happening inbetween will be lost. cgroup_subsys->fork() is invoked way before the new task is added to the css_set. Currently, cgroup_freezer is the only user of ->fork() and uses it to make new tasks conform to the current state of the freezer. If FROZEN state is requested while fork is in progress between cgroup_fork_callbacks() and cgroup_post_fork(), the child could escape freezing - the cgroup isn't frozen when ->fork() is called and the freezer couldn't see the new task on the css_set. This patch moves cgroup_subsys->fork() invocation to cgroup_post_fork() after the new task is added to the css_set. cgroup_fork_callbacks() is removed. Because now a task may be migrated during cgroup_subsys->fork(), freezer_fork() is updated so that it adheres to the usual RCU locking and the rather pointless comment on why locking can be different there is removed (if it doesn't make anything simpler, why even bother?). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> [hq: Backported to 3.4: - Adjust context - Iterate over first CGROUP_BUILTIN_SUBSYS_COUNT elements of subsys] Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-11net: ip, ipv6: handle gso skbs in forwarding pathFlorian Westphal
commit fe6cc55f3a9a053482a76f5a6b2257cee51b4663 upstream. [ use zero netdev_feature mask to avoid backport of netif_skb_dev_features function ] Marcelo Ricardo Leitner reported problems when the forwarding link path has a lower mtu than the incoming one if the inbound interface supports GRO. Given: Host <mtu1500> R1 <mtu1200> R2 Host sends tcp stream which is routed via R1 and R2. R1 performs GRO. In this case, the kernel will fail to send ICMP fragmentation needed messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu checks in forward path. Instead, Linux tries to send out packets exceeding the mtu. When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does not fragment the packets when forwarding, and again tries to send out packets exceeding R1-R2 link mtu. This alters the forwarding dstmtu checks to take the individual gso segment lengths into account. For ipv6, we send out pkt too big error for gso if the individual segments are too big. For ipv4, we either send icmp fragmentation needed, or, if the DF bit is not set, perform software segmentation and let the output path create fragments when the packet is leaving the machine. It is not 100% correct as the error message will contain the headers of the GRO skb instead of the original/segmented one, but it seems to work fine in my (limited) tests. Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid sofware segmentation. However it turns out that skb_segment() assumes skb nr_frags is related to mss size so we would BUG there. I don't want to mess with it considering Herbert and Eric disagree on what the correct behavior should be. Hannes Frederic Sowa notes that when we would shrink gso_size skb_segment would then also need to deal with the case where SKB_MAX_FRAGS would be exceeded. This uses sofware segmentation in the forward path when we hit ipv4 non-DF packets and the outgoing link mtu is too small. Its not perfect, but given the lack of bug reports wrt. GRO fwd being broken this is a rare case anyway. Also its not like this could not be improved later once the dust settles. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-11net: add and use skb_gso_transport_seglen()Florian Westphal
commit de960aa9ab4decc3304959f69533eef64d05d8e8 upstream. [ no skb_gso_seglen helper in 3.4, leave tbf alone ] This moves part of Eric Dumazets skb_gso_seglen helper from tbf sched to skbuff core so it may be reused by upcoming ip forwarding path patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20sched/nohz: Fix rq->cpu_load calculations some morePeter Zijlstra
commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13sched/rt: Avoid updating RT entry timeout twice within one tick periodYing Xue
commit 57d2aa00dcec67afa52478730f2b524521af14fb upstream. The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, it means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Fan Du <fan.du@windriver.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <peterz@infradead.org> Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>