[PTLsim-devel] Memory consistency
Matt T. Yourst
Tue Oct 23 17:51:57 EDT 2007
On Tuesday 23 October 2007 08:30, Stephan Diestelhorst wrote:
> > Real x86 hardware supports AFAIK Total-Store-Ordering as the basic
> > memory model. (Let's skip nc, wc and odd things) I was wondering,
> > how that was enforced with SMT. Inuitively, this would not be
> > necessary, as the SMT should just multiplex front-end and back-end
> > structures and essentially maintains a shared core.
> >
> > Please note that we don't need any fence to ensure ordering here.
> > The underlying consistency model should enforce the mentioned
> > guarantess.
>
> Actually it turns out that this feature has just recently been
> announced by AMD (in the current manuals) and Intel still allows load
> reordering. Hence it is safe to assume that code relying on the order
> of loads will use lfences where necessary.
>
> So I need to look at other places.. Kind of klinging to straws at
> while my split core model dies.
>
I think the current PTLsim SMT core model may not properly enforce all the
load/store ordering rules between threads the same way other x86 chips do.
The SMT core only enforces these ordering constraints:
- For atomic x86 instructions, the ld.acq uop globally locks the cache line
via MemoryInterlockBuffer before loading the value; the lock only gets
released after the entire atomic instruction commits. The microcode for
atomic instructions creates a memory fence (mf uop) before and after the
initial load and final store to move unrelated instructions out of the way.
- Any other loads or stores (even from regular non-atomic instructions!) in
any other threads or cores will be replayed if they try to access a locked
cache line.
- Total store ordering is enforced because stores are visible to all other
threads and cores the instant they pass through the commit stage. This is a
result of the simulation method we use, which models the cache state machine
separately from the actual data words, which are always stored to and loaded
from physical memory instantly at commit (for stores) or issue (for loads),
regardless of the state of the cache lines in the model.
These rules happen to work correctly because most real code (including Linux
and all applications using the standard threading libraries) always uses
atomic instructions and LFENCE/SFENCE just to be safe.
The only possibly problematic case is with the Linux seqlock system, which
depends on total store ordering. It uses LFENCE to ensure any other loads in
the pipeline have issued before the LFENCE proceeds. However, this currently
applies only within a single thread context, which is incorrect (and should
be fixed when we get a chance).
Here's an interesting discussion on RWT where Linus Torvalds reveals some
interesting behavior about LFENCE (specifically, on some Intel processors, it
appears to be internally translated into a NOP):
http://www.realworldtech.com/forums/index.cfm?action=detail&id=73984&threadid=73581&roomid=2
I'll try to take a look at this when I get a chance. It's interesting that
some of Intel's own chips violate the official ordering rules here:
https://ptlsim.org/papers/Intel-x86-Manuals/MemoryOrdering-318147.pdf
- Matt
-------------------------------------------------------
Matt T. Yourst yourst at peptidal.com
Peptidal Research Inc., Co-Founder and Lead Architect
-------------------------------------------------------
More information about the PTLsim-devel mailing list