A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5
The problem is that ECC is used to permit price discrimination between server (less price sensitive) and PC (more price sensitive) users. Like, there’s a significant price difference, more than cost-of-manufacture would warrant. There are only a few companies that make motherboard chipsets, like Intel, and they have enough price control over the industry that they can do that.
Also…I’m not sure that ECC is the right fix. I kind of wonder whether the fact is actually that the memory is broken, or that people are manually overclocking and running memory that would be stable at a lower rate at too high of a rate, which will cause that. Or whether BIOSes, which can automatically detect a viable rate by testing memory, are simply being too aggressive in choosing high memory bandwidth rates.
Also…I’m not sure that ECC is the right fix. I kind of wonder whether the fact is actually that the memory is broken, or that people are manually overclocking and running memory that would be stable at a lower rate at too high of a rate, which will cause that.
Some of it is cosmic rays, right? I think ECC is still worth it even at JEDEC speeds.
My last Intel motherboard couldn’t handle all four slots filled with 32GB of memory at rated speeds. Any two sticks yes, four no. From reading online, apparently that was a common problem. Motherboard manufacturers (who must have known that there were issues, from their own testing) did not go out of their way to make this clear.
Maybe it’s not an issue with registered/buffered memory, but with plain old unregistered DDR5, I think that manufacturers have really selling product above what they can realistically do.
ECC memory and server hardware in general is surprisingly cheap if you’re fine buying used gear that’s a few years old. Once that stuff gets old enough that it’s being cycled out of data centers en masse, it hits the used market and the supply often exceeds the limited demand for that kind of stuff.
With that said, I don’t know if that’s true at the moment.
There’s no real good reason that all RAM shouldn’t have been ECC since decades ago. It doesn’t actually cost much more to implement. The only reason it isn’t, as tal’s reply mentioned, is artificial price discrimination.
*interest in parity-checking server RAM intensifies*
When I upgrade my home server I would like a low-power system with ECC RAM. I hope it will be financially viable in the future.
The problem is that ECC is used to permit price discrimination between server (less price sensitive) and PC (more price sensitive) users. Like, there’s a significant price difference, more than cost-of-manufacture would warrant. There are only a few companies that make motherboard chipsets, like Intel, and they have enough price control over the industry that they can do that.
Also…I’m not sure that ECC is the right fix. I kind of wonder whether the fact is actually that the memory is broken, or that people are manually overclocking and running memory that would be stable at a lower rate at too high of a rate, which will cause that. Or whether BIOSes, which can automatically detect a viable rate by testing memory, are simply being too aggressive in choosing high memory bandwidth rates.
Some of it is cosmic rays, right? I think ECC is still worth it even at JEDEC speeds.
My last Intel motherboard couldn’t handle all four slots filled with 32GB of memory at rated speeds. Any two sticks yes, four no. From reading online, apparently that was a common problem. Motherboard manufacturers (who must have known that there were issues, from their own testing) did not go out of their way to make this clear.
Maybe it’s not an issue with registered/buffered memory, but with plain old unregistered DDR5, I think that manufacturers have really selling product above what they can realistically do.
I’ve been checking around the used market for DDR4. It seems used ECC DDR4 sticks are now cheaper due to low demand.
ECC memory and server hardware in general is surprisingly cheap if you’re fine buying used gear that’s a few years old. Once that stuff gets old enough that it’s being cycled out of data centers en masse, it hits the used market and the supply often exceeds the limited demand for that kind of stuff.
With that said, I don’t know if that’s true at the moment.
In the middle rampocalypse you even wish for an ECC one?
There’s no real good reason that all RAM shouldn’t have been ECC since decades ago. It doesn’t actually cost much more to implement. The only reason it isn’t, as tal’s reply mentioned, is artificial price discrimination.