[PTLsim-devel] Memory consistency

Tue Oct 23 17:51:57 EDT 2007

On Tuesday 23 October 2007 08:30, Stephan Diestelhorst wrote:
> > Real x86 hardware supports AFAIK Total-Store-Ordering as the basic
> > memory model. (Let's skip nc, wc and odd things) I was wondering,
> > how that was enforced with SMT. Inuitively, this would not be
> > necessary, as the SMT should just multiplex front-end and back-end
> > structures and essentially maintains a shared core.
> >
> > Please note that we don't need any fence to ensure ordering here.
> > The underlying consistency model should enforce the mentioned
> > guarantess.
>
> Actually it turns out that this feature has just recently been
> announced by AMD (in the current manuals) and Intel still allows load
> reordering. Hence it is safe to assume that code relying on the order
> of loads will use lfences where necessary.
>
> So I need to look at other places.. Kind of klinging to straws at
> while my split core model dies.
>

I think the current PTLsim SMT core model may not properly enforce all the 
load/store ordering rules between threads the same way other x86 chips do.

The SMT core only enforces these ordering constraints:

- For atomic x86 instructions, the ld.acq uop globally locks the cache line 
via MemoryInterlockBuffer before loading the value; the lock only gets 
released after the entire atomic instruction commits. The microcode for 
atomic instructions creates a memory fence (mf uop) before and after the 
initial load and final store to move unrelated instructions out of the way.

- Any other loads or stores (even from regular non-atomic instructions!) in 
any other threads or cores will be replayed if they try to access a locked 
cache line.

- Total store ordering is enforced because stores are visible to all other 
threads and cores the instant they pass through the commit stage. This is a 
result of the simulation method we use, which models the cache state machine 
separately from the actual data words, which are always stored to and loaded 
from physical memory instantly at commit (for stores) or issue (for loads), 
regardless of the state of the cache lines in the model.

These rules happen to work correctly because most real code (including Linux 
and all applications using the standard threading libraries) always uses 
atomic instructions and LFENCE/SFENCE just to be safe.

The only possibly problematic case is with the Linux seqlock system, which 
depends on total store ordering. It uses LFENCE to ensure any other loads in 
the pipeline have issued before the LFENCE proceeds. However, this currently 
applies only within a single thread context, which is incorrect (and should 
be fixed when we get a chance).

Here's an interesting discussion on RWT where Linus Torvalds reveals some 
interesting behavior about LFENCE (specifically, on some Intel processors, it 
appears to be internally translated into a NOP):

http://www.realworldtech.com/forums/index.cfm?action=detail&id=73984&threadid=73581&roomid=2

I'll try to take a look at this when I get a chance. It's interesting that 
some of Intel's own chips violate the official ordering rules here:

https://ptlsim.org/papers/Intel-x86-Manuals/MemoryOrdering-318147.pdf

- Matt

-------------------------------------------------------
 Matt T. Yourst                    yourst at peptidal.com
 Peptidal Research Inc., Co-Founder and Lead Architect
-------------------------------------------------------