aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2014-02-06x86, cpu, amd: Add workaround for family 16h, erratum 793Borislav Petkov
commit 3b56496865f9f7d9bcb2f93b44c63f274f08e3b6 upstream. This adds the workaround for erratum 793 as a precaution in case not every BIOS implements it. This addresses CVE-2013-6885. Erratum text: [Revision Guide for AMD Family 16h Models 00h-0Fh Processors, document 51810 Rev. 3.04 November 2013] 793 Specific Combination of Writes to Write Combined Memory Types and Locked Instructions May Cause Core Hang Description Under a highly specific and detailed set of internal timing conditions, a locked instruction may trigger a timing sequence whereby the write to a write combined memory type is not flushed, causing the locked instruction to stall indefinitely. Potential Effect on System Processor core hang. Suggested Workaround BIOS should set MSR C001_1020[15] = 1b. Fix Planned No fix planned [ hpa: updated description, fixed typo in MSR name ] Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20140114230711.GS29865@pd.tnic Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06bpf: do not use reciprocal divideEric Dumazet
[ Upstream commit aee636c4809fa54848ff07a899b326eb1f9987a2 ] At first Jakub Zawadzki noticed that some divisions by reciprocal_divide were not correct. (off by one in some cases) http://www.wireshark.org/~darkjames/reciprocal-buggy.c He could also show this with BPF: http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c The reciprocal divide in linux kernel is not generic enough, lets remove its use in BPF, as it is not worth the pain with current cpus. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl> Cc: Mircea Gherzan <mgherzan@gmail.com> Cc: Daniel Borkmann <dxchgb@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Matt Evans <matt@ozlabs.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06KVM: x86: limit PIT timer frequencyMarcelo Tosatti
commit 9ed96e87c5748de4c2807ef17e81287c7304186c upstream. Limit PIT timer frequency similarly to the limit applied by LAPIC timer. Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06x86/efi: Fix off-by-one bug in EFI Boot Services reservationDave Young
commit a7f84f03f660d93574ac88835d056c0d6468aebe upstream. Current code check boot service region with kernel text region by: start+size >= __pa_symbol(_text) The end of the above region should be start + size - 1 instead. I see this problem in ovmf + Fedora 19 grub boot: text start: 1000000 md start: 800000 md size: 800000 Signed-off-by: Dave Young <dyoung@redhat.com> Acked-by: Borislav Petkov <bp@suse.de> Acked-by: Toshi Kani <toshi.kani@hp.com> Tested-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06kvm: x86: fix apic_base enable checkAndrew Jones
commit 0dce7cd67fd9055c4a2ff278f8af1431e646d346 upstream. Commit e66d2ae7c67bd moved the assignment vcpu->arch.apic_base = value above a condition with (vcpu->arch.apic_base ^ value), causing that check to always fail. Use old_value, vcpu->arch.apic_base's old value, in the condition instead. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25ftrace/x86: Load ftrace_ops in parameter not the variable holding itSteven Rostedt
commit 1739f09e33d8f66bf48ddbc3eca615574da6c4f6 upstream. Function tracing callbacks expect to have the ftrace_ops that registered it passed to them, not the address of the variable that holds the ftrace_ops that registered it. Use a mov instead of a lea to store the ftrace_ops into the parameter of the function tracing callback. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20131113152004.459787f9@gandalf.local.home Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10hRobert Richter
commit bee09ed91cacdbffdbcd3b05de8409c77ec9fcd6 upstream. On AMD family 10h we see following error messages while waking up from S3 for all non-boot CPUs leading to a failed IBS initialization: Enabling non-boot CPUs ... smpboot: Booting Node 0 Processor 1 APIC 0x1 [Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu perf: IBS APIC setup failed on cpu #1 process: Switch to broadcast mode on CPU1 CPU1 is up ... ACPI: Waking up from system sleep state S3 Reason for this is that during suspend the LVT offset for the IBS vector gets lost and needs to be reinialized while resuming. The offset is read from the IBSCTL msr. On family 10h the offset needs to be 1 as offset 0 is used for the MCE threshold interrupt, but firmware assings it for IBS to 0 too. The kernel needs to reprogram the vector. The msr is a readonly node msr, but a new value can be written via pci config space access. The reinitialization is implemented for family 10h in setup_ibs_ctl() which is forced during IBS setup. This patch fixes IBS setup after waking up from S3 by adding resume/supend hooks for the boot cpu which does the offset reinitialization. Marking it as stable to let distros pick up this fix. Signed-off-by: Robert Richter <rric@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-15x86, fpu, amd: Clear exceptions in AMD FXSAVE workaroundLinus Torvalds
commit 26bef1318adc1b3a530ecc807ef99346db2aa8b0 upstream. Before we do an EMMS in the AMD FXSAVE information leak workaround we need to clear any pending exceptions, otherwise we trap with a floating-point exception inside this code. Reported-by: halfdog <me@halfdog.net> Tested-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/CA%2B55aFxQnY_PCG_n4=0w-VG=YLXL-yr7oMxyy0WU2gCBAf3ydg@mail.gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: fix TLB flush race between migration, and change_protection_rangeRik van Riel
commit 20841405940e7be0617612d521e206e4b6b325db upstream. There are a few subtle races, between change_protection_range (used by mprotect and change_prot_numa) on one side, and NUMA page migration and compaction on the other side. The basic race is that there is a time window between when the PTE gets made non-present (PROT_NONE or NUMA), and the TLB is flushed. During that time, a CPU may continue writing to the page. This is fine most of the time, however compaction or the NUMA migration code may come in, and migrate the page away. When that happens, the CPU may continue writing, through the cached translation, to what is no longer the current memory location of the process. This only affects x86, which has a somewhat optimistic pte_accessible. All other architectures appear to be safe, and will either always flush, or flush whenever there is a valid mapping, even with no permissions (SPARC). The basic race looks like this: CPU A CPU B CPU C load TLB entry make entry PTE/PMD_NUMA fault on entry read/write old page start migrating page change PTE/PMD to new page read/write old page [*] flush TLB reload TLB from new entry read/write new page lose data [*] the old page may belong to a new user at this point! The obvious fix is to flush remote TLB entries, by making sure that pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may still be accessible if there is a TLB flush pending for the mm. This should fix both NUMA migration and compaction. [mgorman@suse.de: fix build] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09sched: fix the theoretical signal_wake_up() vs schedule() raceOleg Nesterov
commit e0acd0a68ec7dbf6b7a81a87a867ebd7ac9b76c4 upstream. This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if(signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09KVM: x86: Fix APIC map calculation after re-enablingJan Kiszka
commit e66d2ae7c67bd9ac982a3d1890564de7f7eabf4b upstream. Update arch.apic_base before triggering recalculate_apic_map. Otherwise the recalculation will work against the previous state of the APIC and will fail to build the correct map when an APIC is hardware-enabled again. This fixes a regression of 1e08ec4a13. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09x86 idle: Repair large-server 50-watt idle-power regressionLen Brown
commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream. Linux 3.10 changed the timing of how thread_info->flags is touched: x86: Use generic idle loop (7d1a941731fabf27e5fb6edbebb79fe856edb4e5) This caused Intel NHM-EX and WSM-EX servers to experience a large number of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote from deep C-states to shallow C-states, which caused these platforms to experience a significant increase in idle power. Note that this issue was already present before the commit above, however, it wasn't seen often enough to be noticed in power measurements. Here we extend an errata workaround from the Core2 EX "Dunnington" to extend to NHM-EX and WSM-EX, to prevent these immediate returns from MWAIT, reducing idle power on these platforms. While only acpi_idle ran on Dunnington, intel_idle may also run on these two newer systems. As of today, there are no other models that are known to need this tweak. Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com Signed-off-by: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20x86, build: Pass in additional -mno-mmx, -mno-sse optionsH. Peter Anvin
commit 8b3b005d675726e38bc504d2e35a991e55819155 upstream. In checkin 5551a34e5aea x86-64, build: Always pass in -mno-sse we unconditionally added -mno-sse to the main build, to keep newer compilers from generating SSE instructions from autovectorization. However, this did not extend to the special environments (arch/x86/boot, arch/x86/boot/compressed, and arch/x86/realmode/rm). Add -mno-sse to the compiler command line for these environments, and add -mno-mmx to all the environments as well, as we don't want a compiler to generate MMX code either. This patch also removes a $(cc-option) call for -m32, since we have long since stopped supporting compilers too old for the -m32 option, and in fact hardcode it in other places in the Makefiles. Reported-by: Kevin B. Smith <kevin.b.smith@intel.com> Cc: Sunil K. Pandey <sunil.k.pandey@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: H. J. Lu <hjl.tools@gmail.com> Link: http://lkml.kernel.org/n/tip-j21wzqv790q834n7yc6g80j1@git.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20x86, efi: Don't use (U)EFI time services on 32 bitMatthew Garrett
commit 04bf9ba720fcc4fa313fa122b799ae0989b6cd50 upstream. UEFI time services are often broken once we're in virtual mode. We were already refusing to use them on 64-bit systems, but it turns out that they're also broken on some 32-bit firmware, including the Dell Venue. Disable them for now, we can revisit once we have the 1:1 mappings code incorporated. Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com> Link: http://lkml.kernel.org/r/1385754283-2464-1-git-send-email-matthew.garrett@nebula.com Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)Gleb Natapov
commit 17d68b763f09a9ce824ae23eb62c9efc57b69271 upstream. A guest can cause a BUG_ON() leading to a host kernel crash. When the guest writes to the ICR to request an IPI, while in x2apic mode the following things happen, the destination is read from ICR2, which is a register that the guest can control. kvm_irq_delivery_to_apic_fast uses the high 16 bits of ICR2 as the cluster id. A BUG_ON is triggered, which is a protection against accessing map->logical_map with an out-of-bounds access and manages to avoid that anything really unsafe occurs. The logic in the code is correct from real HW point of view. The problem is that KVM supports only one cluster with ID 0 in clustered mode, but the code that has the bug does not take this into account. Reported-by: Lars Bull <larsbull@google.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20KVM: x86: Convert vapic synchronization to _cached functions (CVE-2013-6368)Andy Honig
commit fda4e2e85589191b123d31cdc21fd33ee70f50fd upstream. In kvm_lapic_sync_from_vapic and kvm_lapic_sync_to_vapic there is the potential to corrupt kernel memory if userspace provides an address that is at the end of a page. This patches concerts those functions to use kvm_write_guest_cached and kvm_read_guest_cached. It also checks the vapic_address specified by userspace during ioctl processing and returns an error to userspace if the address is not a valid GPA. This is generally not guest triggerable, because the required write is done by firmware that runs before the guest. Also, it only affects AMD processors and oldish Intel that do not have the FlexPriority feature (unless you disable FlexPriority, of course; then newer processors are also affected). Fixes: b93463aa59d6 ('KVM: Accelerated apic support') Reported-by: Andrew Honig <ahonig@google.com> Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20KVM: x86: Fix potential divide by 0 in lapic (CVE-2013-6367)Andy Honig
commit b963a22e6d1a266a67e9eecc88134713fd54775c upstream. Under guest controllable circumstances apic_get_tmcct will execute a divide by zero and cause a crash. If the guest cpuid support tsc deadline timers and performs the following sequence of requests the host will crash. - Set the mode to periodic - Set the TMICT to 0 - Set the mode bits to 11 (neither periodic, nor one shot, nor tsc deadline) - Set the TMICT to non-zero. Then the lapic_timer.period will be 0, but the TMICT will not be. If the guest then reads from the TMCCT then the host will perform a divide by 0. This patch ensures that if the lapic_timer.period is 0, then the division does not occur. Reported-by: Andrew Honig <ahonig@google.com> Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11x86-64, build: Always pass in -mno-sseH. Peter Anvin
commit 5551a34e5aeab868f8d37f70d8754868921b4ee5 upstream. Always pass in the -mno-sse argument, regardless if -preferred-stack-boundary is supported. We never want to generate SSE instructions in the kernel unless we *really* know what we're doing. According to H. J. Lu, any version of gcc new enough that we support it at all should handle the -mno-sse option, so just add it unconditionally. Reported-by: Kevin B. Smith <kevin.b.smith@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: H. J. Lu <hjl.tools@gmail.com> Link: http://lkml.kernel.org/n/tip-j21wzqv790q834n7yc6g80j1@git.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29ftrace/x86: skip over the breakpoint for ftrace callerKevin Hao
commit ab4ead02ec235d706d0611d8741964628291237e upstream. In commit 8a4d0a687a59 "ftrace: Use breakpoint method to update ftrace caller", we choose to use breakpoint method to update the ftrace caller. But we also need to skip over the breakpoint in function ftrace_int3_handler() for them. Otherwise weird things would happen. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29KVM: x86: fix emulation of "movzbl %bpl, %eax"Paolo Bonzini
commit daf727225b8abfdfe424716abac3d15a3ac5626a upstream. When I was looking at RHEL5.9's failure to start with unrestricted_guest=0/emulate_invalid_guest_state=1, I got it working with a slightly older tree than kvm.git. I now debugged the remaining failure, which was introduced by commit 660696d1 (KVM: X86 emulator: fix source operand decoding for 8bit mov[zs]x instructions, 2013-04-24) introduced a similar mis-emulation to the one in commit 8acb4207 (KVM: fix sil/dil/bpl/spl in the mod/rm fields, 2013-05-30). The incorrect decoding occurs in 8-bit movzx/movsx instructions whose 8-bit operand is sil/dil/bpl/spl. Needless to say, "movzbl %bpl, %eax" does occur in RHEL5.9's decompression prolog, just a handful of instructions before finally giving control to the decompressed vmlinux and getting out of the invalid guest state. Because OpMem8 bypasses decode_modrm, the same handling of the REX prefix must be applied to OpMem8. Reported-by: Michele Baldessari <michele@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29x86/microcode/amd: Tone down printk(), don't treat a missing firmware file ↵Thomas Renninger
as an error commit 11f918d3e2d3861b6931e97b3aa778e4984935aa upstream. Do it the same way as done in microcode_intel.c: use pr_debug() for missing firmware files. There seem to be CPUs out there for which no microcode update has been submitted to kernel-firmware repo yet resulting in scary sounding error messages in dmesg: microcode: failed to load file amd-ucode/microcode_amd_fam16h.bin Signed-off-by: Thomas Renninger <trenn@suse.de> Acked-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1384274383-43510-1-git-send-email-trenn@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29sched, idle: Fix the idle polling state logicPeter Zijlstra
commit ea8117478918a4734586d35ff530721b682425be upstream. Mike reported that commit 7d1a9417 ("x86: Use generic idle loop") regressed several workloads and caused excessive reschedule interrupts. The patch in question failed to notice that the x86 code had an inverted sense of the polling state versus the new generic code (x86: default polling, generic: default !polling). Fix the two prominent x86 mwait based idle drivers and introduce a few new generic polling helpers (fixing the wrong smp_mb__after_clear_bit usage). Also switch the idle routines to using tif_need_resched() which is an immediate TIF_NEED_RESCHED test as opposed to need_resched which will end up being slightly different. Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: lenb@kernel.org Cc: tglx@linutronix.de Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13x86: Update UV3 hub revision IDRuss Anderson
commit dd3c9c4b603c664fedc12facf180db0f1794aafe upstream. The UV3 hub revision ID is different than expected. The first revision was supposed to start at 1 but instead will start at 0. Signed-off-by: Russ Anderson <rja@sgi.com> Link: http://lkml.kernel.org/r/20131014161733.GA6274@sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-18x86: avoid remapping data in parse_setup_data()Linn Crosetto
commit 30e46b574a1db7d14404e52dca8e1aa5f5155fd2 upstream. Type SETUP_PCI, added by setup_efi_pci(), may advertise a ROM size larger than early_memremap() is able to handle, which is currently limited to 256kB. If this occurs it leads to a NULL dereference in parse_setup_data(). To avoid this, remap the setup_data header and allow parsing functions for individual types to handle their own data remapping. Signed-off-by: Linn Crosetto <linn@hp.com> Link: http://lkml.kernel.org/r/1376430401-67445-1-git-send-email-linn@hp.com Acked-by: Yinghai Lu <yinghai@kernel.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-18compiler/gcc4: Add quirk for 'asm goto' miscompilation bugIngo Molnar
commit 3f0116c3238a96bc18ad4b4acefe4e7be32fa861 upstream. Fengguang Wu, Oleg Nesterov and Peter Zijlstra tracked down a kernel crash to a GCC bug: GCC miscompiles certain 'asm goto' constructs, as outlined here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 Implement a workaround suggested by Jakub Jelinek. Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Suggested-by: Jakub Jelinek <jakub@redhat.com> Reviewed-by: Richard Henderson <rth@twiddle.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20131015062351.GA4666@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05x86, efi: Don't map Boot Services on i386Josh Boyer
commit 700870119f49084da004ab588ea2b799689efaf7 upstream. Add patch to fix 32bit EFI service mapping (rhbz 726701) Multiple people are reporting hitting the following WARNING on i386, WARNING: at arch/x86/mm/ioremap.c:102 __ioremap_caller+0x3d3/0x440() Modules linked in: Pid: 0, comm: swapper Not tainted 3.9.0-rc7+ #95 Call Trace: [<c102b6af>] warn_slowpath_common+0x5f/0x80 [<c1023fb3>] ? __ioremap_caller+0x3d3/0x440 [<c1023fb3>] ? __ioremap_caller+0x3d3/0x440 [<c102b6ed>] warn_slowpath_null+0x1d/0x20 [<c1023fb3>] __ioremap_caller+0x3d3/0x440 [<c106007b>] ? get_usage_chars+0xfb/0x110 [<c102d937>] ? vprintk_emit+0x147/0x480 [<c1418593>] ? efi_enter_virtual_mode+0x1e4/0x3de [<c102406a>] ioremap_cache+0x1a/0x20 [<c1418593>] ? efi_enter_virtual_mode+0x1e4/0x3de [<c1418593>] efi_enter_virtual_mode+0x1e4/0x3de [<c1407984>] start_kernel+0x286/0x2f4 [<c1407535>] ? repair_env_string+0x51/0x51 [<c1407362>] i386_start_kernel+0x12c/0x12f Due to the workaround described in commit 916f676f8 ("x86, efi: Retain boot service code until after switching to virtual mode") EFI Boot Service regions are mapped for a period during boot. Unfortunately, with the limited size of the i386 direct kernel map it's possible that some of the Boot Service regions will not be directly accessible, which causes them to be ioremap()'d, triggering the above warning as the regions are marked as E820_RAM in the e820 memmap. There are currently only two situations where we need to map EFI Boot Service regions, 1. To workaround the firmware bug described in 916f676f8 2. To access the ACPI BGRT image but since we haven't seen an i386 implementation that requires either, this simple fix should suffice for now. [ Added to changelog - Matt ] Reported-by: Bryan O'Donoghue <bryan.odonoghue.lkml@nexus-software.ie> Acked-by: Tom Zanussi <tom.zanussi@intel.com> Acked-by: Darren Hart <dvhart@linux.intel.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05x86/reboot: Add quirk to make Dell C6100 use reboot=pci automaticallyMasoud Sharbiani
commit 4f0acd31c31f03ba42494c8baf6c0465150e2621 upstream. Dell PowerEdge C6100 machines fail to completely reboot about 20% of the time. Signed-off-by: Masoud Sharbiani <msharbiani@twitter.com> Signed-off-by: Vinson Lee <vlee@twitter.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Link: http://lkml.kernel.org/r/1379717947-18042-1-git-send-email-vlee@freedesktop.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26sched/x86: Optimize switch_mm() for multi-threaded workloadsRik van Riel
commit 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 upstream. Dick Fowles, Don Zickus and Joe Mario have been working on improvements to perf, and noticed heavy cache line contention on the mm_cpumask, running linpack on a 60 core / 120 thread system. The cause turned out to be unnecessary atomic accesses to the mm_cpumask. When in lazy TLB mode, the CPU is only removed from the mm_cpumask if there is a TLB flush event. Most of the time, no such TLB flush happens, and the kernel skips the TLB reload. It can also skip the atomic memory set & test. Here is a summary of Joe's test results: * The __schedule function dropped from 24% of all program cycles down to 5.5%. * The cacheline contention/hotness for accesses to that bitmask went from being the 1st/2nd hottest - down to the 84th hottest (0.3% of all shared misses which is now quite cold) * The average load latency for the bit-test-n-set instruction in __schedule dropped from 10k-15k cycles down to an average of 600 cycles. * The linpack program results improved from 133 GFlops to 144 GFlops. Peak GFlops rose from 133 to 153. Reported-by: Don Zickus <dzickus@redhat.com> Reported-by: Joe Mario <jmario@redhat.com> Tested-by: Joe Mario <jmario@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Paul Turner <pjt@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com [ Made the comments consistent around the modified code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26x86/mce: Pay no attention to 'F' bit in MCACOD when parsing 'UC' errorsTony Luck
commit 0ca06c0857aee11911f91621db14498496f2c2cd upstream. The 0x1000 bit of the MCACOD field of machine check MCi_STATUS registers is only defined for corrected errors (where it means that hardware may be filtering errors see SDM section 15.9.2.1). For uncorrected errors it may, or may not be set - so we should mask it out when checking for the architecturaly defined recoverable error signatures (see SDM 15.9.3.1 and 15.9.3.2) Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26x86, amd_nb: Clarify F15h, model 30h GART and L3 supportAravind Gopalakrishnan
commit 7d64ac6422092adbbdaa279ab32f9d4c90a84558 upstream. F15h, models 0x30 and later don't have a GART. Note that. Also check CPUID leaf 0x80000006 for L3 prescence because there are models which don't sport an L3 cache. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> [ Boris: rewrite commit message, cleanup comments. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26Introduce [compat_]save_altstack_ex() to unbreak x86 SMAPAl Viro
commit bd1c149aa9915b9abb6d83d0f01dfd2ace0680b5 upstream. For performance reasons, when SMAP is in use, SMAP is left open for an entire put_user_try { ... } put_user_catch(); block, however, calling __put_user() in the middle of that block will close SMAP as the STAC..CLAC constructs intentionally do not nest. Furthermore, using __put_user() rather than put_user_ex() here is bad for performance. Thus, introduce new [compat_]save_altstack_ex() helpers that replace __[compat_]save_altstack() for x86, being currently the only architecture which supports put_user_try { ... } put_user_catch(). Reported-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/n/tip-es5p6y64if71k8p5u08agv9n@git.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26x86, smap: Handle csum_partial_copy_*_user()H. Peter Anvin
commit 7263dda41b5a28ae6566fd126d9b06ada73dd721 upstream. Add SMAP annotations to csum_partial_copy_to/from_user(). These functions legitimately access user space and thus need to set the AC flag. TODO: add explicit checks that the side with the kernel space pointer really points into kernel space. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/n/tip-2aps0u00eer658fd5xyanan7@git.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-14crypto: xor - Check for osxsave as well as avx in crypto/xorJohn Haxby
commit edb6f29464afc65fc73767540b854abf63ae7144 upstream. This affects xen pv guests with sufficiently old versions of xen and sufficiently new hardware. On such a system, a guest with a btrfs root won't even boot. Signed-off-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Michael Marineau <michael.marineau@coreos.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-07x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAMYinghai Lu
commit 527bf129f9a780e11b251cf2467dc30118a57d16 upstream. Dave Hansen reported that systems between 500G and 600G RAM crash early if DEBUG_PAGEALLOC is selected. > [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff] > [ 0.000000] [mem 0x00000000-0x000fffff] page 4k > [ 0.000000] BRK [0x02086000, 0x02086fff] PGTABLE > [ 0.000000] BRK [0x02087000, 0x02087fff] PGTABLE > [ 0.000000] BRK [0x02088000, 0x02088fff] PGTABLE > [ 0.000000] init_memory_mapping: [mem 0xe80ee00000-0xe80effffff] > [ 0.000000] [mem 0xe80ee00000-0xe80effffff] page 4k > [ 0.000000] BRK [0x02089000, 0x02089fff] PGTABLE > [ 0.000000] BRK [0x0208a000, 0x0208afff] PGTABLE > [ 0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory It turns out that we missed increasing needed pages in BRK to mapping initial 2M and [0,1M) when we switched to use the #PF handler to set memory mappings: > commit 8170e6bed465b4b0c7687f93e9948aca4358a33b > Author: H. Peter Anvin <hpa@zytor.com> > Date: Thu Jan 24 12:19:52 2013 -0800 > > x86, 64bit: Use a #PF handler to materialize early mappings on demand Before that, we had the maping from [0,512M) in head_64.S, and we can spare two pages [0-1M). After that change, we can not reuse pages anymore. When we have more than 512M ram, we need an extra page for pgd page with [512G, 1024g). Increase pages in BRK for page table to solve the boot crash. Reported-by: Dave Hansen <dave.hansen@intel.com> Bisected-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1376351004-4015-1-git-send-email-yinghai@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29x86/xen: do not identity map UNUSABLE regions in the machine E820David Vrabel
commit 3bc38cbceb85881a8eb789ee1aa56678038b1909 upstream. If there are UNUSABLE regions in the machine memory map, dom0 will attempt to map them 1:1 which is not permitted by Xen and the kernel will crash. There isn't anything interesting in the UNUSABLE region that the dom0 kernel needs access to so we can avoid making the 1:1 mapping and treat it as RAM. We only do this for dom0, as that is where tboot case shows up. A PV domU could have an UNUSABLE region in its pseudo-physical map and would need to be handled in another patch. This fixes a boot failure on hosts with tboot. tboot marks a region in the e820 map as unusable and the dom0 kernel would attempt to map this region and Xen does not permit unusable regions to be mapped by guests. (XEN) 0000000000000000 - 0000000000060000 (usable) (XEN) 0000000000060000 - 0000000000068000 (reserved) (XEN) 0000000000068000 - 000000000009e000 (usable) (XEN) 0000000000100000 - 0000000000800000 (usable) (XEN) 0000000000800000 - 0000000000972000 (unusable) tboot marked this region as unusable. (XEN) 0000000000972000 - 00000000cf200000 (usable) (XEN) 00000000cf200000 - 00000000cf38f000 (reserved) (XEN) 00000000cf38f000 - 00000000cf3ce000 (ACPI data) (XEN) 00000000cf3ce000 - 00000000d0000000 (reserved) (XEN) 00000000e0000000 - 00000000f0000000 (reserved) (XEN) 00000000fe000000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 0000000630000000 (usable) Signed-off-by: David Vrabel <david.vrabel@citrix.com> [v1: Altered the patch and description with domU's with UNUSABLE regions] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29x86 get_unmapped_area: Access mmap_legacy_base through mm_struct memberRadu Caragea
commit 41aacc1eea645c99edbe8fbcf78a97dc9b862adc upstream. This is the updated version of df54d6fa5427 ("x86 get_unmapped_area(): use proper mmap base for bottom-up direction") that only randomizes the mmap base address once. Signed-off-by: Radu Caragea <sinaelgl@gmail.com> Reported-and-tested-by: Jeff Shorey <shoreyjeff@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Adrian Sendroiu <molecula2788@gmail.com> Cc: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29Revert "x86 get_unmapped_area(): use proper mmap base for bottom-up direction"Linus Torvalds
commit 5ea80f76a56605a190a7ea16846c82aa63dbd0aa upstream. This reverts commit df54d6fa54275ce59660453e29d1228c2b45a826. The commit isn't necessarily wrong, but because it recalculates the random mmap_base every time, it seems to confuse user memory allocators that expect contiguous mmap allocations even when the mmap address isn't specified. In particular, the MATLAB Java runtime seems to be unhappy. See https://bugzilla.kernel.org/show_bug.cgi?id=60774 So we'll want to apply the random offset only once, and Radu has a patch for that. Revert this older commit in order to apply the other one. Reported-by: Jeff Shorey <shoreyjeff@gmail.com> Cc: Radu Caragea <sinaelgl@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29x86: Don't clear olpc_ofw_header when sentinel is detectedDaniel Drake
commit d55e37bb0f51316e552376ddc0a3fff34ca7108b upstream. OpenFirmware wasn't quite following the protocol described in boot.txt and the kernel has detected this through use of the sentinel value in boot_params. OFW does zero out almost all of the stuff that it should do, but not the sentinel. This causes the kernel to clear olpc_ofw_header, which breaks x86 OLPC support. OpenFirmware has now been fixed. However, it would be nice if we could maintain Linux compatibility with old firmware versions. To do that, we just have to avoid zeroing out olpc_ofw_header. OFW does not write to any other parts of the header that are being zapped by the sentinel-detection code, and all users of olpc_ofw_header are somewhat protected through checking for the OLPC_OFW_SIG magic value before using it. So this should not cause any problems for anyone. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/20130809221420.618E6FAB03@dev.laptop.org Acked-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29xen/smp: initialize IPI vectors before marking CPU onlineChuck Anderson
commit fc78d343fa74514f6fd117b5ef4cd27e4ac30236 upstream. An older PVHVM guest (v3.0 based) crashed during vCPU hot-plug with: kernel BUG at drivers/xen/events.c:1328! RCU has detected that a CPU has not entered a quiescent state within the grace period. It needs to send the CPU a reschedule IPI if it is not offline. rcu_implicit_offline_qs() does this check: /* * If the CPU is offline, it is in a quiescent state. We can * trust its state not to change because interrupts are disabled. */ if (cpu_is_offline(rdp->cpu)) { rdp->offline_fqs++; return 1; } Else the CPU is online. Send it a reschedule IPI. The CPU is in the middle of being hot-plugged and has been marked online (!cpu_is_offline()). See start_secondary(): set_cpu_online(smp_processor_id(), true); ... per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; start_secondary() then waits for the CPU bringing up the hot-plugged CPU to mark it as active: /* * Wait until the cpu which brought this one up marked it * online before enabling interrupts. If we don't do that then * we can end up waking up the softirq thread before this cpu * reached the active state, which makes the scheduler unhappy * and schedule the softirq thread on the wrong cpu. This is * only observable with forced threaded interrupts, but in * theory it could also happen w/o them. It's just way harder * to achieve. */ while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask)) cpu_relax(); /* enable local interrupts */ local_irq_enable(); The CPU being hot-plugged will be marked active after it has been fully initialized by the CPU managing the hot-plug. In the Xen PVHVM case xen_smp_intr_init() is called to set up the hot-plugged vCPU's XEN_RESCHEDULE_VECTOR. The hot-plugging CPU is marked online, not marked active and does not have its IPI vectors set up. rcu_implicit_offline_qs() sees the hot-plugging cpu is !cpu_is_offline() and tries to send it a reschedule IPI: This will lead to: kernel BUG at drivers/xen/events.c:1328! xen_send_IPI_one() xen_smp_send_reschedule() rcu_implicit_offline_qs() rcu_implicit_dynticks_qs() force_qs_rnp() force_quiescent_state() __rcu_process_callbacks() rcu_process_callbacks() __do_softirq() call_softirq() do_softirq() irq_exit() xen_evtchn_do_upcall() because xen_send_IPI_one() will attempt to use an uninitialized IRQ for the XEN_RESCHEDULE_VECTOR. There is at least one other place that has caused the same crash: xen_smp_send_reschedule() wake_up_idle_cpu() add_timer_on() clocksource_watchdog() call_timer_fn() run_timer_softirq() __do_softirq() call_softirq() do_softirq() irq_exit() xen_evtchn_do_upcall() xen_hvm_callback_vector() clocksource_watchdog() uses cpu_online_mask to pick the next CPU to handle a watchdog timer: /* * Cycle through CPUs to check if the CPUs stay synchronized * to each other. */ next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask); if (next_cpu >= nr_cpu_ids) next_cpu = cpumask_first(cpu_online_mask); watchdog_timer.expires += WATCHDOG_INTERVAL; add_timer_on(&watchdog_timer, next_cpu); This resulted in an attempt to send an IPI to a hot-plugging CPU that had not initialized its reschedule vector. One option would be to make the RCU code check to not check for CPU offline but for CPU active. As becoming active is done after a CPU is online (in older kernels). But Srivatsa pointed out that "the cpu_active vs cpu_online ordering has been completely reworked - in the online path, cpu_active is set *before* cpu_online, and also, in the cpu offline path, the cpu_active bit is reset in the CPU_DYING notification instead of CPU_DOWN_PREPARE." Drilling in this the bring-up path: "[brought up CPU].. send out a CPU_STARTING notification, and in response to that, the scheduler sets the CPU in the cpu_active_mask. Again, this mask is better left to the scheduler alone, since it has the intelligence to use it judiciously." The conclusion was that: " 1. At the IPI sender side: It is incorrect to send an IPI to an offline CPU (cpu not present in the cpu_online_mask). There are numerous places where we check this and warn/complain. 2. At the IPI receiver side: It is incorrect to let the world know of our presence (by setting ourselves in global bitmasks) until our initialization steps are complete to such an extent that we can handle the consequences (such as receiving interrupts without crashing the sender etc.) " (from Srivatsa) As the native code enables the interrupts at some point we need to be able to service them. In other words a CPU must have valid IPI vectors if it has been marked online. It doesn't need to handle the IPI (interrupts may be disabled) but needs to have valid IPI vectors because another CPU may find it in cpu_online_mask and attempt to send it an IPI. This patch will change the order of the Xen vCPU bring-up functions so that Xen vectors have been set up before start_secondary() is called. It also will not continue to bring up a Xen vCPU if xen_smp_intr_init() fails to initialize it. Orabug 13823853 Signed-off-by Chuck Anderson <chuck.anderson@oracle.com> Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-20x86 get_unmapped_area(): use proper mmap base for bottom-up directionRadu Caragea
commit df54d6fa54275ce59660453e29d1228c2b45a826 upstream. When the stack is set to unlimited, the bottomup direction is used for mmap-ings but the mmap_base is not used and thus effectively renders ASLR for mmapings along with PIE useless. Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Adrian Sendroiu <molecula2788@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-20perf/x86: Fix intel QPI uncore event definitionsVince Weaver
commit c9601247f8f3fdc18aed7ed7e490e8dfcd07f122 upstream. John McCalpin reports that the "drs_data" and "ncb_data" QPI uncore events are missing the "extra bit" and always return zero values unless the bit is properly set. More details from him: According to the Xeon E5-2600 Product Family Uncore Performance Monitoring Guide, Table 2-94, about 1/2 of the QPI Link Layer events (including the ones that "perf" calls "drs_data" and "ncb_data") require that the "extra bit" be set. This was confusing for a while -- a note at the bottom of page 94 says that the "extra bit" is bit 16 of the control register. Unfortunately, Table 2-86 clearly says that bit 16 is reserved and must be zero. Looking around a bit, I found that bit 21 appears to be the correct "extra bit", and further investigation shows that "perf" actually agrees with me: [root@c560-003.stampede]# cat /sys/bus/event_source/devices/uncore_qpi_0/format/event config:0-7,21 So the command # perf -e "uncore_qpi_0/event=drs_data/" Is the same as # perf -e "uncore_qpi_0/event=0x02,umask=0x08/" While it should be # perf -e "uncore_qpi_0/event=0x102,umask=0x08/" I confirmed that this last version gives results that agree with the amount of data that I expected the STREAM benchmark to move across the QPI link in the second (cross-chip) test of the original script. Reported-by: John McCalpin <mccalpin@tacc.utexas.edu> Signed-off-by: Vince Weaver <vincent.weaver@maine.edu> Cc: zheng.z.yan@intel.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Paul Mackerras <paulus@samba.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1308021037280.26119@vincent-weaver-1.um.maine.edu Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-11x86/iommu/vt-d: Expand interrupt remapping quirk to cover x58 chipsetNeil Horman
commit 803075dba31c17af110e1d9a915fe7262165b213 upstream. Recently we added an early quirk to detect 5500/5520 chipsets with early revisions that had problems with irq draining with interrupt remapping enabled: commit 03bbcb2e7e292838bb0244f5a7816d194c911d62 Author: Neil Horman <nhorman@tuxdriver.com> Date: Tue Apr 16 16:38:32 2013 -0400 iommu/vt-d: add quirk for broken interrupt remapping on 55XX chipsets It turns out this same problem is present in the intel X58 chipset as well. See errata 69 here: http://www.intel.com/content/www/us/en/chipsets/x58-express-specification-update.html This patch extends the pci early quirk so that the chip devices/revisions specified in the above update are also covered in the same way: Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Donald Dutile <ddutile@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Malcolm Crossley <malcolm.crossley@citrix.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/r/1374059639-8631-1-git-send-email-nhorman@tuxdriver.com [ Small edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-11x86, fpu: correct the asm constraints for fxsave, unbreak mxcsr.dazH.J. Lu
commit eaa5a990191d204ba0f9d35dbe5505ec2cdd1460 upstream. GCC will optimize mxcsr_feature_mask_init in arch/x86/kernel/i387.c: memset(&fx_scratch, 0, sizeof(struct i387_fxsave_struct)); asm volatile("fxsave %0" : : "m" (fx_scratch)); mask = fx_scratch.mxcsr_mask; if (mask == 0) mask = 0x0000ffbf; to memset(&fx_scratch, 0, sizeof(struct i387_fxsave_struct)); asm volatile("fxsave %0" : : "m" (fx_scratch)); mask = 0x0000ffbf; since asm statement doesn’t say it will update fx_scratch. As the result, the DAZ bit will be cleared. This patch fixes it. This bug dates back to at least kernel 2.6.12. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04x86: Fix /proc/mtrr with base/size more than 44bitsYinghai Lu
commit d5c78673b1b28467354c2c30c3d4f003666ff385 upstream. On one sytem that mtrr range is more then 44bits, in dmesg we have [ 0.000000] MTRR default type: write-back [ 0.000000] MTRR fixed ranges enabled: [ 0.000000] 00000-9FFFF write-back [ 0.000000] A0000-BFFFF uncachable [ 0.000000] C0000-DFFFF write-through [ 0.000000] E0000-FFFFF write-protect [ 0.000000] MTRR variable ranges enabled: [ 0.000000] 0 [000080000000-0000FFFFFFFF] mask 3FFF80000000 uncachable [ 0.000000] 1 [380000000000-38FFFFFFFFFF] mask 3F0000000000 uncachable [ 0.000000] 2 [000099000000-000099FFFFFF] mask 3FFFFF000000 write-through [ 0.000000] 3 [00009A000000-00009AFFFFFF] mask 3FFFFF000000 write-through [ 0.000000] 4 [381FFA000000-381FFBFFFFFF] mask 3FFFFE000000 write-through [ 0.000000] 5 [381FFC000000-381FFC0FFFFF] mask 3FFFFFF00000 write-through [ 0.000000] 6 [0000AD000000-0000ADFFFFFF] mask 3FFFFF000000 write-through [ 0.000000] 7 [0000BD000000-0000BDFFFFFF] mask 3FFFFF000000 write-through [ 0.000000] 8 disabled [ 0.000000] 9 disabled but /proc/mtrr report wrong: reg00: base=0x080000000 ( 2048MB), size= 2048MB, count=1: uncachable reg01: base=0x80000000000 (8388608MB), size=1048576MB, count=1: uncachable reg02: base=0x099000000 ( 2448MB), size= 16MB, count=1: write-through reg03: base=0x09a000000 ( 2464MB), size= 16MB, count=1: write-through reg04: base=0x81ffa000000 (8519584MB), size= 32MB, count=1: write-through reg05: base=0x81ffc000000 (8519616MB), size= 1MB, count=1: write-through reg06: base=0x0ad000000 ( 2768MB), size= 16MB, count=1: write-through reg07: base=0x0bd000000 ( 3024MB), size= 16MB, count=1: write-through reg08: base=0x09b000000 ( 2480MB), size= 16MB, count=1: write-combining so bit 44 and bit 45 get cut off. We have problems in arch/x86/kernel/cpu/mtrr/generic.c::generic_get_mtrr(). 1. for base, we miss cast base_lo to 64bit before shifting. Fix that by adding u64 casting. 2. for size, it only can handle 44 bits aka 32bits + page_shift Fix that with 64bit mask instead of 32bit mask_lo, then range could be more than 44bits. At the same time, we need to update size_or_mask for old cpus that does support cpuid 0x80000008 to get phys_addr. Need to set high 32bits to all 1s, otherwise will not get correct size for them. Also fix mtrr_add_page: it should check base and (base + size - 1) instead of base and size, as base and size could be small but base + size could bigger enough to be out of boundary. We can use boot_cpu_data.x86_phys_bits directly to avoid size_or_mask. So When are we going to have size more than 44bits? that is 16TiB. after patch we have right ouput: reg00: base=0x080000000 ( 2048MB), size= 2048MB, count=1: uncachable reg01: base=0x380000000000 (58720256MB), size=1048576MB, count=1: uncachable reg02: base=0x099000000 ( 2448MB), size= 16MB, count=1: write-through reg03: base=0x09a000000 ( 2464MB), size= 16MB, count=1: write-through reg04: base=0x381ffa000000 (58851232MB), size= 32MB, count=1: write-through reg05: base=0x381ffc000000 (58851264MB), size= 1MB, count=1: write-through reg06: base=0x0ad000000 ( 2768MB), size= 16MB, count=1: write-through reg07: base=0x0bd000000 ( 3024MB), size= 16MB, count=1: write-through reg08: base=0x09b000000 ( 2480MB), size= 16MB, count=1: write-combining -v2: simply checking in mtrr_add_page according to hpa. [ hpa: This probably wants to go into -stable only after having sat in mainline for a bit. It is not a regression. ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1371162815-29931-1-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04x86: make sure IDT is page alignedKees Cook
based on 4df05f361937ee86e5a8c9ead8aeb6a19ea9b7d7 upstream. Since the IDT is referenced from a fixmap, make sure it is page aligned. This avoids the risk of the IDT ever being moved in the bss and having the mapping be offset, resulting in calling incorrect handlers. In the current upstream kernel this is not a manifested bug, but heavily patched kernels (such as those using the PaX patch series) did encounter this bug. Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: PaX Team <pageexec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04x86, suspend: Handle CPUs which fail to #GP on RDMSRH. Peter Anvin
commit 5ff560fd48d5b3d82fa0c3aff625c9da1a301911 upstream. There are CPUs which have errata causing RDMSR of a nonexistent MSR to not fault. We would then try to WRMSR to restore the value of that MSR, causing a crash. Specifically, some Pentium M variants would have this problem trying to save and restore the non-existent EFER, causing a crash on resume. Work around this by making sure we can write back the result at suspend time. Huge thanks to Christian Sünkenberg for finding the offending erratum that finally deciphered the mystery. Reported-and-tested-by: Johan Heinrich <onny@project-insanity.org> Debugged-by: Christian Sünkenberg <christian.suenkenberg@student.kit.edu> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Link: http://lkml.kernel.org/r/51DDC972.3010005@student.kit.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21xen/time: remove blocked time accounting from xen "clockchip"Laszlo Ersek
commit 0b0c002c340e78173789f8afaa508070d838cf3d upstream. ... because the "clock_event_device framework" already accounts for idle time through the "event_handler" function pointer in xen_timer_interrupt(). The patch is intended as the completion of [1]. It should fix the double idle times seen in PV guests' /proc/stat [2]. It should be orthogonal to stolen time accounting (the removed code seems to be isolated). The approach may be completely misguided. [1] https://lkml.org/lkml/2011/10/6/10 [2] http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01068.html John took the time to retest this patch on top of v3.10 and reported: "idle time is correctly incremented for pv and hvm for the normal case, nohz=off and nohz=idle." so lets put this patch in. Signed-off-by: Laszlo Ersek <lersek@redhat.com> Signed-off-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21x86, efi: retry ExitBootServices() on failureZach Bobroff
commit d3768d885c6ccbf8a137276843177d76c49033a7 upstream. ExitBootServices is absolutely supposed to return a failure if any ExitBootServices event handler changes the memory map. Basically the get_map loop should run again if ExitBootServices returns an error the first time. I would say it would be fair that if ExitBootServices gives an error the second time then Linux would be fine in returning control back to BIOS. The second change is the following line: again: size += sizeof(*mem_map) * 2; Originally you were incrementing it by the size of one memory map entry. The issue here is all related to the low_alloc routine you are using. In this routine you are making allocations to get the memory map itself. Doing this allocation or allocations can affect the memory map by more than one record. [ mfleming - changelog, code style ] Signed-off-by: Zach Bobroff <zacharyb@ami.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-13KVM: VMX: mark unusable segment as nonpresentGleb Natapov
commit 03617c188f41eeeb4223c919ee7e66e5a114f2c6 upstream. Some userspaces do not preserve unusable property. Since usable segment has to be present according to VMX spec we can use present property to amend userspace bug by making unusable segment always nonpresent. vmx_segment_access_rights() already marks nonpresent segment as unusable. Reported-by: Stefan Pietsch <stefan.pietsch@lsexperts.de> Tested-by: Stefan Pietsch <stefan.pietsch@lsexperts.de> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-26Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Three small fixlets" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot() hw_breakpoint: Fix cpu check in task_bp_pinned(cpu) kprobes: Fix arch_prepare_kprobe to handle copy insn failures