[PTLsim-devel] Issues with multiple OOO-cores

Mon Sep 3 22:31:18 EDT 2007

On Monday 03 September 2007 11:26, Stephan Diestelhorst wrote:
> Hi Matt and list,
>   thanks again for the detailed outline.
> I have started hacking, but have stumbled (quite as expected)  upon some
> issues, see below. Additionally, I propose a lighter "poor man's" MESI
> implementation, see bottom of this mail!
>

One of my colleagues, Avadh Patel (apatel at cs.binghamton.edu), has done quite a 
lot of work on the MESI model. I just found out about this last week, so I'll 
see if I can post some of his code. Right now it isn't integrated with PTLsim 
since he's still testing it, but it looks like a really good design. I've 
attached the design spec (PDF) he wrote up.

> >
> > // Memory bus transaction types
> > enum {
> >   TRANSACTION_MESI_S_TO_M,   // Shared to modified (broadcast to
> > invalidate) TRANSACTION_MESI_E_TO_S,   // Exclusive to shared (broadcast
> > to update) TRANSACTION_MESI_S_TO_E,   // Shared to exclusive (broadcast
> > to update) TRANSACTION_DRAM_FETCH,    // fetch from DRAM: access memory
> > modules ...
> >   // TRANSACTION_MESI_E_TO_M not needed - done locally
> >   // TRANSACTION_MESI_M_TO_E not needed - done locally
> >   RESPONSE_MESI_INVALIDATED_LINE,// Tells all other nodes the requested
> > line // was invalidated (i.e. by MESI_S_TO_M) ...
> > };
>
> It is not clear to me, why we would need so many transaction types. In
> normal M(O)ESI terminology, we either have (shared) probes or invalidating
> ones and associated replies.
>

Most of these transactions can be done locally in the L1 without any coherence 
actions. For simulation purposes (so we can log them) and so stats can be 
collected, they're all defined here, even if they do not cause actual bus 
traffic.

> > Example: If a line is in the SHARED state and a store to that line is
> > attempted (i.e. commitstore() is called), it has to be moved to the
> > MODIFIED state. You'll need to add code to the SMT core to cause stores
> > to wait in the ROB at commit time until the line is actually moved to the
> > new MODIFIED state.
>
> That would mean to stall the commit engine of the entire CPU for the
> duration of the entire bus-transaction (including waiting for answers). I'm
> not sure, whether the ROB is big enough to mask this delay! In real CPUs
> wouldn't you offload this asynchronously to a write buffer and freeing the
> commit entry?
>

I think I explained that poorly. 

In most x86 chips I'm aware of, the processor is not allowed to commit a store 
into an L1 line if that line is currently in the SHARED state (it has to be 
moved to the MODIFIED state first). The SHARED->MODIFIED transition is 
initiated in the background as soon as the store's address is known (so the 
store and all instructions surrounding it can execute without stalls), but it 
needs to complete by the time the store reaches the head of the ROB.

There are no (currently disclosed) x86 chips that implement transactional 
memory, in which the store leaves the ROB and writes into the L1 even before 
it's certain no other cores have copies of that line. In a TM-based model, 
the coherence checks can be deferred until the line leaves the L1 (or write 
buffer) at the end of a transaction, which is what you're suggesting if I 
understand correctly.

Fortunately, with a big ROB, the commit can be 50-100 cycles in the future 
after the store first issues. On average, coherence transactions can complete 
within this time in most machines - although it certainly is a scalability 
problem if you have lots of shared data. Many studies of this TM-based 
approach indicate it's indeed faster than a blocking ROB when lots of shared 
data is present (speculative lock elision is one reason - look it up).

The situation is more complicated for true load-execute-store atomic 
instructions. For these, the ld.acq uop is like a store: it must put the line 
into the MODIFIED state BEFORE the initial value is loaded - otherwise we 
could have a race condition with another processor (there are optimizations 
to avoid this in some cases, but it other cases ld.acq must be treated like a 
memory barrier - the microcode for e.g. CMPXCHG does this).

This means ld.acq could block the ROB and have to wait 100+ cycles if the line 
is in the SHARED state. In practice, this is very rare, since most lines that 
contain spinlocks/semaphores/etc. are in the EXCLUSIVE state in the L1 cache 
of the core that previously acquired the lock. In this case, the 
EXCLUSIVE->MODIFIED transition happens at the same time the ld.acq issues, so 
no delays are needed. Most threading libraries and kernels try to put each 
lock on a separate 64-byte line without unrelated data near it.

>
> Where would you place the according logic? I've started implementing this
> inside the CacheHierarchy, with the core just probing the hierarchy,
> whether the commit of the store is possible and if not, instructs the
> CacheHierarchy to initiate the misses etc.
>
> This is a more general question, is communication between Core <-> Core
> (and then descending to their respective caches) or rather CacheHierarchy
> <-> CacheHierarchy (and those probably contacting the associated cores)?
> This is more of a design issue, but any pointers would be nice!
>
> Apart from this, I wonder, whether this effort is really necessary, I'm
> envisioning some kind of poor man's MESI:
>
> MESI's main prupose is to ensure correctness in a physically disconnected
> system. Inside the simulator, we can ensure correctness in a much simpler
> way, as we have instant visibility and a single place for the data (the
> main memory of the simulator).
>
> Given that, simulating MESI is just interesting from a simulation accuracy
> point of view. And if we assume that in original CPU designs everything is
> done to mask the latencies occuring from MESI bus-transactions as far as
> possible (most importantly *not* stalling the commit engine, but rather
> loading that off to a big-enough write buffer), there are IMHO just a few
> points, where MESI affects performance:
>
> 1) additional cache evictions on a core, due to modifications from other
> cores
>
> 2) different latency for loads that miss our cache and hit in other caches,
> as we might get the data from the other cache, i.e. we wouldn't have a full
> MAIN_MEM_LATENCY, but rather some CROSS_L2_LATENCY (this would be
> significant only if CROSS_L2_LATENCY << MAIN_MEM_LATENCY)
> This holds only of course, if MAIN_MEM_LATENCY + L2_LATENCY >
> MESI_BUS_LATENCY.
>
> Effect 1) can be emulated by simply removing a cache line from all other
> CacheHierarchies for a store on some core. This would omit the delay it
> takes for the request to propagate, but how much precision does that kill?
>
> For 2) we might have to increase MAIN_MEM_LATENCY in case bus transactions
> take longer than going to main memory. And if we assume that caches forward
> shared data (as we might have some fast HyperTransport link between them,
> and don't have to go on the FSB ;-) ) we can simply check in
> issueload_slowpath, wether the other CacheHIerarchies have it (in the
> simulator, with zero delay) and then set use CROSS_L2_LATENCY, rather than
> MAIN_MEM_LATENCY.
>
> Any comments on this? Has anybody tried something similar and has some
> insight? Simulation results? It seems to me, the changes are pretty simple,
> so I will try to go that route first (due to time constraints on my
> side :-/ )
>

The accuracy question is really a philosophical issue - I know many people are 
using PTLsim to measure things like bus contention and power modeling, so 
just measuring delays will not help them. On the other hand, this simplified 
approach is likely to be very difficult to debug and collect statistics - it 
might appear to work, but you'll never be able to inspect the actual miss 
queues and bus traffic to make sure it's accurate.

Let me take a look at what Avadh has done so far - I think he already 
implemented the various queues and a blocking bus design. I'll post his code 
as soon as he cleans it up a little bit, even though it's not complete yet.

- Matt

-------------------------------------------------------
 Matt T. Yourst                    yourst at peptidal.com
 Peptidal Research Inc., Co-Founder and Lead Architect
-------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PTLsim-MESI-Design-Proposal-from-Avadh.pdf
Type: application/pdf
Size: 93865 bytes
Desc: not available
Url : https://ptlsim.org/pipermail/ptlsim-devel/attachments/20070903/6af9cac0/attachment-0001.pdf