[PTLsim-devel] Issues with multiple OOO-cores

Tue Sep 4 18:51:33 EDT 2007

Hi Matt,
  thanks for the quick reply!

> One of my colleagues, Avadh Patel (apatel at cs.binghamton.edu), has done quite a
> lot of work on the MESI model. I just found out about this last week, so I'll
> see if I can post some of his code. Right now it isn't integrated with PTLsim
> since he's still testing it, but it looks like a really good design. I've
> attached the design spec (PDF) he wrote up.

Read through the design document, it seems that this is a fairly clean
design, but comes with a fair number of added components.

> > That would mean to stall the commit engine of the entire CPU for the
> > duration of the entire bus-transaction (including waiting for answers). I'm
> > not sure, whether the ROB is big enough to mask this delay! In real CPUs
> > wouldn't you offload this asynchronously to a write buffer and freeing the
> > commit entry?
> >
>
> I think I explained that poorly.
>
> In most x86 chips I'm aware of, the processor is not allowed to commit a store
> into an L1 line if that line is currently in the SHARED state (it has to be
> moved to the MODIFIED state first). The SHARED->MODIFIED transition is
> initiated in the background as soon as the store's address is known (so the
> store and all instructions surrounding it can execute without stalls), but it
> needs to complete by the time the store reaches the head of the ROB.

Alright, that was actually the way that I'd do it with just the ROB,
problem with this approach is that misspeculated stores can cause
unnecessary probe traffic and worse can cause spurious evictions on
other caches.

> There are no (currently disclosed) x86 chips that implement transactional
> memory, in which the store leaves the ROB and writes into the L1 even before
> it's certain no other cores have copies of that line.
> In a TM-based model,
> the coherence checks can be deferred until the line leaves the L1 (or write
> buffer) at the end of a transaction, which is what you're suggesting if I
> understand correctly.

Actually I was thinking about some additional buffer between the two
(core and L1), but L1 could actually fit the same purpose, meaning
that it might not at all times be fully coherent system-wide. I'm not
sure how TM would fit into this. I think that the store, as soon as it
reaches commit will succeed anyways, no matter how long it takes to
anounce that action.
This can create odd effects, where two cores do update B, read A  and
update A, read B  and both cores read the old value, but that should
be treatable with fences, right?  But that would mean that fencces
check 'my' additional write-buffer, too. I can't seem to see anything
speculative anymore, once the store has been commited, so how would TM
fit into here?

> Fortunately, with a big ROB, the commit can be 50-100 cycles in the future
> after the store first issues. On average, coherence transactions can complete
> within this time in most machines - although it certainly is a scalability
> problem if you have lots of shared data. Many studies of this TM-based
> approach indicate it's indeed faster than a blocking ROB when lots of shared
> data is present (speculative lock elision is one reason - look it up).

Read the lock elision paper, but again, I'm not quite sure, how this
all fits together. Will have to read again more carefully, probably.

> The situation is more complicated for true load-execute-store atomic
> instructions. <snip>

It is indeed. I haven't yet thought about the interaction between the
locks (as in the prefix and PTLsim's current implementation) and cache
coherency play together. But loads with intent-to-modify or atomic
instructions containing them are expected to be failry expensive,
anyways, whereas a normal store under my mis-understood implementation
idea would stall the entire ROB for no good reason.

> > Apart from this, I wonder, whether this effort is really necessary, I'm
> > envisioning some kind of poor man's MESI:
> > ...
> The accuracy question is really a philosophical issue - I know many people are
> using PTLsim to measure things like bus contention and power modeling, so
> just measuring delays will not help them. On the other hand, this simplified
> approach is likely to be very difficult to debug and collect statistics - it
> might appear to work, but you'll never be able to inspect the actual miss
> queues and bus traffic to make sure it's accurate.

Yes, but would you be able to inspect those in a real system to
compare that you're actually accurate in comparison to something real?
THis is no trolling intention!

> Let me take a look at what Avadh has done so far - I think he already
> implemented the various queues and a blocking bus design. I'll post his code
> as soon as he cleans it up a little bit, even though it's not complete yet.

That would be very helpful. Perhaps I could help out there, too. I
will however try my modifications, as they're really tiny and check,
whether they give some sane results (meaning big picture benchmarks).

--Stephan
--
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst at amd.com, Tel.  (AMD: 8-4903)