Your Intel x86 CPU is Deeply Flawed (Meltdown/Spectre)

ramnet · January 2018

Some are suggesting to enable First-Party Isolation in Firefox and Site Isolation in Chrome to help mitigate against this attacking web browsers.

Don't forget, the real big exploit here is Javascript being able to dump out the entire memory of your web browser to steal passwords and cookies and stuff from the browser cache. That is a much bigger issue than same-node VPS attacks.

For Firefox, go to: about:config?filter=privacy.firstparty.isolate

For Chrome, go to: chrome://flags/#enable-site-per-process respectively

Not a prefect fix but better than nothing at this time.

rds100 · January 2018

So it is not possible to just disable speculative execution and branch prediction on modern CPUs?

Nekki · January 2018

https://meltdownattack.com

Nekki · January 2018

Get patching, motherfuckers.

Aidan · January 2018

Nekki said:

Get patching, motherfuckers.

No.

Darwin · January 2018

@rds100 said:
So it is not possible to just disable speculative execution and branch prediction on modern CPUs?

If you could disable it you were probably going to get equal or worst performance of a same clocked old p4

Nekki · January 2018

@Aidan said:

Nekki said:

Get patching, motherfuckers.

No.

Then reap the whirlwind, bitch.

@raindog308 request thread title change to ‘All your CPU are belong to us’.

perennate · January 2018

Darwin said: If you could disable it you were probably going to get equal or worst performance of a same clocked old p4

Actually apparently branch prediction has been around since the 1950's, and in most processors since the 1990's. Even Pentium 1 uses branch prediction. [1]

But yeah the performance would probably not be that bad. It'd depend on the application though, if your code has no branches, then it should run at the same speed, if it mostly consists of a loop with a very short body then it will be very slow. Programs generally spend most of their time in loops though.

Branch prediction actually comes up in a surprisingly large number of performance analyses, for example in many SQL databases, if the selectivity of a query is near 0% or 100% then it will execute faster than the same query with selectivity near 50% (selectivity is the percentage of rows that match the query predicate; also the performance difference is only noticeable in some cases and only in high-performance databases, e.g. if the query ever needs to touch the disk then the disk latency will likely be much higher than the time spent in CPU identifying matching rows).

[1] https://en.wikipedia.org/wiki/Branch_predictor#History

MasonR · January 2018

@Nekki said: @raindog308 request thread title change to ‘All your CPU are belong to us’.

'You must construct additional page tables!'

perennate · January 2018

perennate said: Branch prediction actually comes up in a surprisingly large number of performance analyses, for example in many SQL databases, if the selectivity of a query is near 0% or 100% then it will execute faster than the same query with selectivity near 50% (selectivity is the percentage of rows that match the query predicate; also the performance difference is only noticeable in some cases and only in high-performance databases, e.g. if the query ever needs to touch the disk then the disk latency will likely be much higher than the time spent in CPU identifying matching rows).

Here's a toy example: https://pastebin.com/gjg9bFpJ

This shows how many seconds were spent in a short loop that takes an action with a certain probability.

$ go run test.go -selectivity 0.5
0.5970034390000001
$ go run test.go -selectivity 0.01
0.532437771
$ go run test.go -selectivity 0.99
0.5557861350000001

It becomes much more apparent if we make the loop even simpler (remove rand.Float64 call out of the loop):

$ go run test.go -selectivity 0.99
0.052104648
$ go run test.go -selectivity 0.5
0.114960668
$ go run test.go -selectivity 0.01
0.050877901

LjL · January 2018

My host is rebooting nodes in small batches for now, and "seeing what happens". This is also due to changing information they say they are receiving about which platforms may or may not be affected (they have Intel, ARMHF, and ARM64 hosts).

I like that so far, they are keeping a relatively detailed log of their findings: https://status.online.net/index.php?do=details&task_id=1116 — although my servers there are far from critical and I don't care to know exact downtimes, it's still kinda useful information considering that official disclosures about this set of CPU issues have seemed... somewhat suboptimal.

perennate · January 2018

rds100 said: So it is not possible to just disable speculative execution and branch prediction on modern CPUs?

Someone on Hacker News said "on pre-Zen AMD there is also a chicken bit to disable indirect branch prediction" (with new microcode). I think this would only fix part of the attack? But I'm not sure.

https://news.ycombinator.com/item?id=16068840

bsdguy · January 2018

@ramnet said:
Some are suggesting to enable First-Party Isolation in Firefox and Site Isolation in Chrome to help mitigate against this attacking web browsers.

Don't forget, the real big exploit here is Javascript being able to dump out the entire memory of your web browser to steal passwords and cookies and stuff from the browser cache. That is a much bigger issue than same-node VPS attacks.

Good remark. Just digest that: some javascript interpreters (and jits) are "better" tools and more dangerous that the interpreter in the linux kernel!

Which also means that all those scriptable servers out there (about 143% of all web servers it seems) are invitations to hackers - and known as quite well hackable. Cheerio!

@rds100 said:
So it is not possible to just disable speculative execution and branch prediction on modern CPUs?

Yeah, right. But then you should grab your old drum back from the garage because it's going to allow for faster communication than an x86(-64) processor.

@perennate said:
Actually apparently branch prediction has been around since the 1950's, and in most processors since the 1990's. Even Pentium 1 uses branch prediction. [1]

branch prediction can mean a lot of things. Back then it meant to guess which branch was more likely. Later there came speculative evaluation (note the term not being "execution") a.s.o.

Those were quite different from what we're talking about here.

Side note: one of the major problems is complexity. The complexity of modern x86 is insanely high plus, to make it funnier (even more complex) very complex relations and dependencies are also in the time domain.

For those interested: that's similar to what we see in software design (I mean "design" as in "engineers at work" and not "funny hacking") and verification (in all steps, from design to binary). Code per se is quite simple (frama-c, for instance, does that since years). Where it gets ugly is a) memory (dynamically allocated linked lists, anyone?), and b) time, both as in real-time and in time distance and interdependance. Tools for both of these are in relatively early phases and mostly used in and worked on in academia (with lots of help from the math department).

TL;DR a modern processor design is very hard to fully and really test, particularly when considering that many "important!" points weren't even seen and considered when designed.

perennate · January 2018

bsdguy said: branch prediction can mean a lot of things. Back then it meant to guess which branch was more likely. Later there came speculative evaluation (note the term not being "execution") a.s.o.

I don't think there's any point to guess which branch is more likely without performing speculative execution. The entire idea is to avoid having empty pipelines up to the branch. Anyway, I was referring to branch prediction with speculative execution being around since the 1950's:

The IBM Stretch, designed in the late 1950s, pre-executes all unconditional branches and any conditional branches that depended on the index registers. For other conditional branches, the first two production models implemented predict untaken; subsequent models were changed to implement predictions based on the current values of the indicator bits (corresponding to today's condition codes). ... Misprediction recovery was provided by the lookahead unit on Stretch, and part of Stretch's reputation for less-than-stellar performance was blamed on the time required for misprediction recovery.

The only fundamental difference is that modern branch prediction algorithms are much more complex, and so are able to achieve lower misprediction rates.

Edit: to make it more clear, speculative execution depends on branch prediction -- the processor predicts which branch will be taken, and then uses that prediction to continue executing instructions following the branch. You can't have speculative execution without branch prediction. And similarly I don't think it makes sense to have branch prediction without speculative execution. So they are essentially two components of the same system, which is usually just called "branch prediction" (at least in computer architecture).

https://www.quora.com/What-is-the-difference-between-branch-prediction-and-hardware-speculation

Edit2: anyway I don't know why you're getting so caught on the naming :P I think we both understand the general procedure that the CPU uses.

bsdguy · January 2018

@perennate said:

The IBM Stretch, designed in the late 1950s, pre-executes all unconditional branches and any conditional branches that depended on the index registers...

The 7030 wasn't a processor but a mainframe. Slight difference.

Edit: to make it more clear, speculative execution depends on branch prediction -- the processor predicts which branch will be taken, and then uses that prediction to continue executing instructions following the branch. You can't have speculative execution without branch prediction. And similarly I don't think it makes sense to have branch prediction without speculative execution. So they are essentially two components of the same system, which is usually just called "branch prediction" (at least in computer architecture).

Nope. branch prediction without spec. exec. does make sense and is helpful; that's why it was done. Probably the most important reason being that the processor can pre load memory - with memory reading being by far more relevant for performance than spec.exec (which is just yet another and relatively small performance enhancement).

That's why caches came up and then multi-level caches. And that's why there is no simple microcode patch. Both, disabling branch prediction and disabling the L1 cache is prohibitive in terms of performance; it would lead to people stopping to buy new processors and or to apply the microcode patch. The other reason is that even just the memory sub system in a modern x86 (or arm) is insanely complex, and to make it worse, it's condemned to stay compatible.

mksh · January 2018

@perennate said:

bsdguy said: branch prediction can mean a lot of things. Back then it meant to guess which branch was more likely. Later there came speculative evaluation (note the term not being "execution") a.s.o.

I don't think there's any point to guess which branch is more likely without performing speculative execution.

Well, i sure am no expert here but i see a value even in simple prediction because it allow prefetching the (likely) target address without wasting cache on the (likely) losing branch.

Edit: to make it more clear, speculative execution depends on branch prediction -- the processor predicts which branch will be taken, and then uses that prediction to continue executing instructions following the branch. You can't have speculative execution without branch prediction. And similarly I don't think it makes sense to have branch prediction without speculative execution. So they are essentially two components of the same system, which is usually just called "branch prediction" (at least in computer architecture).

Well kinda since it's named speculative but if your not going to split hairs over this and considering most systems these days have multiple cores i think just stupidly executing both branches and throwing away the results of the losing one once the real result is known could lead to an overall performance benefit unless all your cores are at 100% usage.

perennate · January 2018

bsdguy said: Nope. branch prediction without spec. exec. does make sense and is helpful; that's why it was done. Probably the most important reason being that the processor can pre load memory - with memory reading being by far more relevant for performance than spec.exec (which is just yet another and relatively small performance enhancement).

That's why caches came up and then multi-level caches. And that's why there is no simple microcode patch. Both, disabling branch prediction and disabling the L1 cache is prohibitive in terms of performance; it would lead to people stopping to buy new processors and or to apply the microcode patch. The other reason is that even just the memory sub system in a modern x86 (or arm) is insanely complex, and to make it worse, it's condemned to stay compatible.

I'm not sure why you're arguing about something so silly. As I said before, I think we both understand the general procedure that the CPU uses.

If you want to continue arguing: in CPU architecture, predicting the branch + speculatively executing instructions after the branch is generally referred to as "branch prediction". Neither term is perfect -- branch prediction for the reason you mentioned, and speculative execution as it can refer execution based on other types of predictions. I'd say that it makes sense that branch prediction is the accepted term because what you said about preloading memory is not useful in modern processors, the portion of the CPU pipeline for fetching instructions is far too small to make much of a difference.

perennate · January 2018

mksh said: Well kinda since it's named speculative but if your not going to split hairs over this and considering most systems these days have multiple cores i think just stupidly executing both branches and throwing away the results of the losing one once the real result is known could lead to an overall performance benefit unless all your cores are at 100% usage.

CPU's could theoretically do that but they generally don't. This is because branch prediction is correct 99+% of the time. e.g. if you have a loop, you're more likely to continue the loop than to exit it. I think modern CPUs can actually take the loop indices into account somehow. With such high accuracy, spending cycles to execute the other branch would be slower. But yeah, executing both branches is a pretty interesting idea. And I agree that we are just splitting hairs over the name.

Edit: actually it might be more like 95%, I'm not sure. Of course it's going to depend a lot on the application, e.g. for image/video processing I'd guess that it's close to 100% since you have lots of loops that are executed for huge number of iterations, but for a database maybe lower.

Hxxx · January 2018

You can tell some people replying here are Googling that shit hard to be able to reply with some fancy shit. Hard work boi!

perennate · January 2018

Hxxx said: You can tell some people replying here are Googling that shit hard to be able to reply with some fancy shit. Hard work boi!

Branch prediction is a fundamental concept in computer architecture. I think most people who major'd in CS in college know what it is.

But yeah I googled about things like how long it has been around, whether speculative execution is actually a better name (it's not), and some statistics. I tried to include links to stuff like this.

Edit: the one thing I want to find the source for is the effect of branch prediction on query performance in databases based on the selectivity of the query, but I can't find the paper I remember it from. I guess there's this one but it's actually studying branch prediction, the cool thing about the one I remember is that they weren't considering branch prediction in their analysis and they were actually surprised by the results that they got.

bsdguy · January 2018

@perennate said:
If you want to continue arguing: in CPU architecture, predicting the branch + speculatively executing instructions after the branch is generally referred to as "branch prediction". Neither term is perfect -- branch prediction for the reason you mentioned, and speculative execution as it can refer execution based on other types of predictions. I'd say that it makes sense that branch prediction is the accepted term because what you said about preloading memory is not useful in modern processors, the portion of the CPU pipeline for fetching instructions is far too small to make much of a difference.

I don't care a fuck about what quora says. spec exec is but a minor improvement. Memory pre loading is, depending on cache depth, cache architecture, and prediction quality a performance enhancement in the range up to about 100-fold.

You brought up the 7030, i.e. the very beginning. Then there were no caches or ridiculously small and one-dimensional ones and having all data available or not was a killer difference. The same is still largely true; it just has been perfected and extended.

@perennate said:
CPU's could theoretically do that but they generally don't. This is because branch prediction is correct 99+% of the time. e.g. if you have a loop, ...y interesting idea. And I agree that we are just splitting hairs over the name.

Edit: actually it might be more like 95%,

Bullshit. For a start, loops are child's cases. Even in the first 8088 the CX register held everything that was needed to "predict" for most loops, haha. And for those loops for which that isn't true, the real problem is the underlying if-else branch problem - which to predict can have pretty much any success rate between nada and 90+%.

Moreover that prediction usually doesn't come free, in particular when, as is often the case, indirect mem access is involved. It's often simply cheaper to not predict at all. In other cases the likely winners data is fetched into L1 and the likely losers data is (prefetched, too but) left in L2 or even L3, each of which is much faster than a mem access.
Also keep in mind that indirect mem access often has to make rather large and sometimes even irregular steps (think of a variable size structure and allocated dynamically which isn't rare at all).

perennate · January 2018

bsdguy said: I don't care a fuck about what quora says. spec exec is but a minor improvement. Memory pre loading is, depending on cache depth, cache architecture, and prediction quality a performance enhancement in the range up to about 100-fold.

You brought up the 7030, i.e. the very beginning. Then there were no caches or ridiculously small and one-dimensional ones and having all data available or not was a killer difference. The same is still largely true; it just has been perfected and extended.

You're the one who started arguing about the name. I wasn't replying to you initially, I'm not sure why you felt the need to post paragraphs and paragraphs about this.

Wikipedia and https://people.cs.clemson.edu/~mark/stretch.html both pretty clearly state that the IBM Stretch does speculatively execute instructions, which need to have their effects rolled back in the event of a misprediction, obviously it's going to be a much smaller number of instructions than in modern processors. I was just posting that initially to talk about the history of branch prediction because I thought it was interesting... it was not at all in reply to anything you said...

bsdguy said: Bullshit. For a start, loops are child's cases. Even in the first 8088 the CX register held everything that was needed to "predict" for most loops, haha. And for those loops for which that isn't true, the real problem is the underlying if-else branch problem - which to predict can have pretty much any success rate between nada and 90+%.

No practical CPU speculatively executes both branches...

Anyway I'm done replying about this, geez.

Hxxx · January 2018

Is really not that fundamental. If we are talking real world scenarios ... You know that everyone that studied CS forgets about 60% of the useless shit since after all you will be eventually stuck somewhere doing apps or doing some shitty reports / BI / AI .

Lets be realistic.
Unless of course you are some Engineer doing hardware architecture or a new OS.

@perennate said:

Hxxx said: You can tell some people replying here are Googling that shit hard to be able to reply with some fancy shit. Hard work boi!

Branch prediction is a fundamental concept in computer architecture. I think most people who major'd in CS in college know what it is.

But yeah I googled about things like how long it has been around, whether speculative execution is actually a better name (it's not), and some statistics. I tried to include links to stuff like this.

Edit: the one thing I want to find the source for is the effect of branch prediction on query performance in databases based on the selectivity of the query, but I can't find the paper I remember it from. I guess there's this one but it's actually studying branch prediction, the cool thing about the one I remember is that they weren't considering branch prediction in their analysis and they were actually surprised by the results that they got.

perennate · January 2018

Hxxx said: Lets be realistic. Unless of course you are some Engineer doing hardware architecture or a new OS.

I meant it's fundamental to computer architecture, not computer science

That's true about forgetting it, I agree. I should have said they would have likely learned it at some point. IMO most people also probably can recall it given context.

Shazan · January 2018

Sorry, I am not a CPU and kernel expert but, if somebody tries to exploit this vulnerability, eg looking for passwords in the whole available memory, wouldn't we see a very high load on the machine?

perennate · January 2018

Shazan said: Sorry, I am not a CPU and kernel expert but, if somebody tries to exploit this vulnerability, eg looking for passwords in the whole available memory, wouldn't we see a very high load on the machine?

This is something I'm not too familiar with, but I'd expect that they would be able to simply decrease the CPU utilization used to exploit the vulnerability at the expense of taking more time to finish the exploit. I don't think the relevant addresses dynamically change but I could be wrong.

Edit: actually I was thinking about the part of the KVM exploit that first finds the address of certain KVM/kernel code. Once you have basically finished the exploit and are able to read memory, the relevant addresses (i.e., addresses where there is sensitive data like passwords, other VM memory) do probably dynamically change. But odds are that you'll eventually read something sensitive.

rm_ · January 2018

Darwin said: If you could disable it you were probably going to get equal or worst performance of a same clocked old p4

I don't think it's possible to toggle it on the fly on any processor.

But in the ARM world, there are two very similar CPU cores: Cortex-A9 and Cortex-A7. The A9 uses out-of-order execution (and all the specular shit), and therefore is affected by both variants of the exploit. The A7 is unaffected, it's a simpler strictly in-order core, but since it is smaller, it can be (or is only) shipped as a part of dual/quad-core CPUs.

Here's the kicker: on single threaded tasks, the A7 is only slower by 15% at most. And since it's usually multi-core, the resulting system actually shows a win over the single-core A9 in real world usage.

So it seems like the out-of-order execution is not such a big deal, and maybe the next step in the computing technology would be similar to what ARM did there, a shift to CPUs with smaller and simpler (but secure!) strictly in-order cores, but a whole lot more of them in a single CPU.

Neoon · January 2018

https://twitter.com/misc0110/status/948706387491786752

bsdguy · January 2018

@perennate said:
Wikipedia and https://people.cs.clemson.edu/~mark/stretch.html both pretty clearly state that the IBM Stretch does speculatively execute instructions

Which part of "the 7030 is a mainframe while the x86s are processors" do you fail to understand? Let me give you a hint: size and "some MHz" vs "multiple GHz".

No practical CPU speculatively executes both branches...

Funny. Because that's what we are talking about. Hint: The problem we discuss here happens while the loser branch is speculatively executed (and the winner branch, of course, anyway). In a practical cpu, namely intel cpus, amd cpus, and arm cpus.

You see, I certainly understand that one can get excited about that stuff; it's a very interesting, albeit very complex field, after all. I can also understand that one likes to be seen as knowing a lot - and you do know a lot but I can not understand why you didn't stop at your limit point, beyond which you don't look that good anymore.

After thinking a little about things and staring with glazed eyes at my ryzen system I'd like to switch (at least interim) to quite different aspects.

I think, that amd will gain from that clusterfuck, possibly even massively. For a start, it's "innocent"; after all the one defining everything x86 is intel.

Way more importantly though, intel pretty much is a monstrous behemoth and largely a chip factory with some r&d people enhancing some bits here and some bits there. Also importantly, intels r&d has for a long time been strongly focussed on process technology, in particular on ever smaller gate sizes and how to cost effectively mass produce them.

amd on the other hand is a much smaller player and - that's important - one that couldn't come back to (a not insignificant) live again by merely me-too or simply via price (it'd be an afternoon walk for intel to kill amd if they tried that). Instead amd tried to create a better x86 design within the given limits defined by the past decades.
One result of that is that quite some bits and pieces of their processors are fresh or significantly enhanced designs; being the underdog they could and had to do that, i.e. to move in the space that's hard to reach for intel.

Also, looking at the current clusterfuck, one clearly sees that intels design verification failed. Now, for the sake of fairness, the issue we talk about is in a very ugly range, namely, additionally to normal code verification and very, very much adding to complexity, in both the space and the time domain, both of which are very hard verification problems (separation logic, anyone?).

That, and the fact that intel is condemned to carry a very large pile of historic dirt, most of which probably was never properly verified, with it are what I take to be the root cause for the cluster fuckup. Sure, other factors like market weight, arrogance and others play into it, too, but those are more contextual and not the root causes.

Plus, amd has yet another advantage. Not only is it (well, it must) more agile but its current processor lines are still in revisioning cycles anyway and they have way less proc. families.. This means that most probably they can be much faster in coming up with hardware changes that a) support kernel mitigations and b) strongly drive down the performance loss (think "hardware supported functionality").

For intel that's probably quite a bit more difficult and, looking at their structure, my guess is that intel will take quite a bit longer and/or with a less attractive mitigation.

qrwteyrutiyoup · January 2018

keyword of the day: "retpoline"
https://support.google.com/faqs/answer/7625886

Interesting summary in this patch:
https://reviews.llvm.org/D41723

Howdy, Stranger!

Categories

In this Discussion

Your Intel x86 CPU is Deeply Flawed (Meltdown/Spectre)

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Your Intel x86 CPU is Deeply Flawed (Meltdown/Spectre)

Comments