[PTLsim-devel] Issues with multiple OOO-cores

Mon Sep 3 11:26:28 EDT 2007

Hi Matt and list,
  thanks again for the detailed outline.
I have started hacking, but have stumbled (quite as expected)  upon some 
issues, see below. Additionally, I propose a lighter "poor man's" MESI 
implementation, see bottom of this mail!

<snipped random bits below>
> Here's what I wrote up for my research group back in May 2007. As far as I
> know, they haven't made much progress on this, but I'll ask around just to
> be sure. You're welcome to get started, and I'd be glad to provide help if
> you get stuck.
>
> To implement MESI state transitions like this, you'll need to set up a
> special MemoryBus structure, very similar to the MissBuffer class (you can
> copy this code into the MemoryBus class). MemoryBus is declared globally or
> inside SMTMachine, rather than per-core like MissBuffer.
>
> // Memory bus transaction types
> enum {
>   TRANSACTION_MESI_S_TO_M,   // Shared to modified (broadcast to
> invalidate) TRANSACTION_MESI_E_TO_S,   // Exclusive to shared (broadcast to
> update) TRANSACTION_MESI_S_TO_E,   // Shared to exclusive (broadcast to
> update) TRANSACTION_DRAM_FETCH,    // fetch from DRAM: access memory
> modules ...
>   // TRANSACTION_MESI_E_TO_M not needed - done locally
>   // TRANSACTION_MESI_M_TO_E not needed - done locally
>   RESPONSE_MESI_INVALIDATED_LINE,// Tells all other nodes the requested
> line // was invalidated (i.e. by MESI_S_TO_M) ...
> };

It is not clear to me, why we would need so many transaction types. In normal 
M(O)ESI terminology, we either have (shared) probes or invalidating ones and 
associated replies.

> Example: If a line is in the SHARED state and a store to that line is
> attempted (i.e. commitstore() is called), it has to be moved to the
> MODIFIED state. You'll need to add code to the SMT core to cause stores to
> wait in the ROB at commit time until the line is actually moved to the new
> MODIFIED state.

That would mean to stall the commit engine of the entire CPU for the duration 
of the entire bus-transaction (including waiting for answers). I'm not sure, 
whether the ROB is big enough to mask this delay! In real CPUs wouldn't you 
offload this asynchronously to a write buffer and freeing the commit entry?

> For the example above (i.e. move line from SHARED to MODIFIED state), the
> following actions should be done:
>
> - Add an entry for the line to the local core's MissBuffer, and put that
> miss buffer in a new state called STATE_MESI_TRANSITION (add this to the
> STATE_xxx enum). This means the line is undergoing a transition from one
> state to another. The line itself (in both the L1D and L2, since L2 is
> inclusive) is also put in the MESI_TRANSITION state.
>
> - Add an entry for the line to the MemoryBus object, and put that entry
> into the WAIT_FOR_BUS state.
>
> Obviously these actions need to be done atomically - if there isn't space
> in both MissBuffer and MemoryBus, the load or store causing the transition
> must be replayed until there is space.
> ....

Where would you place the according logic? I've started implementing this 
inside the CacheHierarchy, with the core just probing the hierarchy, whether 
the commit of the store is possible and if not, instructs the CacheHierarchy 
to initiate the misses etc.

This is a more general question, is communication between Core <-> Core (and 
then descending to their respective caches) or rather CacheHierarchy <-> 
CacheHierarchy (and those probably contacting the associated cores)? This is 
more of a design issue, but any pointers would be nice!

Apart from this, I wonder, whether this effort is really necessary, I'm 
envisioning some kind of poor man's MESI:

MESI's main prupose is to ensure correctness in a physically disconnected 
system. Inside the simulator, we can ensure correctness in a much simpler 
way, as we have instant visibility and a single place for the data (the main 
memory of the simulator).

Given that, simulating MESI is just interesting from a simulation accuracy 
point of view. And if we assume that in original CPU designs everything is 
done to mask the latencies occuring from MESI bus-transactions as far as 
possible (most importantly *not* stalling the commit engine, but rather 
loading that off to a big-enough write buffer), there are IMHO just a few 
points, where MESI affects performance:

1) additional cache evictions on a core, due to modifications from other cores

2) different latency for loads that miss our cache and hit in other caches, as 
we might get the data from the other cache, i.e. we wouldn't have a full 
MAIN_MEM_LATENCY, but rather some CROSS_L2_LATENCY (this would be significant 
only if CROSS_L2_LATENCY << MAIN_MEM_LATENCY)
This holds only of course, if MAIN_MEM_LATENCY + L2_LATENCY > 
MESI_BUS_LATENCY.

Effect 1) can be emulated by simply removing a cache line from all other 
CacheHierarchies for a store on some core. This would omit the delay it takes 
for the request to propagate, but how much precision does that kill?

For 2) we might have to increase MAIN_MEM_LATENCY in case bus transactions 
take longer than going to main memory. And if we assume that caches forward 
shared data (as we might have some fast HyperTransport link between them, and 
don't have to go on the FSB ;-) ) we can simply check in issueload_slowpath, 
wether the other CacheHIerarchies have it (in the simulator, with zero delay) 
and then set use CROSS_L2_LATENCY, rather than MAIN_MEM_LATENCY.

Any comments on this? Has anybody tried something similar and has some 
insight? Simulation results? It seems to me, the changes are pretty simple, 
so I will try to go that route first (due to time constraints on my 
side :-/ )

Glad to hear any suggestions!

Thanks,
  Stephan

-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst at amd.com, Tel.   (AMD: 8-4903)