[PTLsim-devel] Memory consistency

Wed Oct 24 12:06:13 EDT 2007

> The SMT core only enforces these ordering constraints:
>
> - For atomic x86 instructions, the ld.acq uop globally locks the
> cache line via MemoryInterlockBuffer before loading the value; the
> lock only gets released after the entire atomic instruction
> commits. The microcode for atomic instructions creates a memory
> fence (mf uop) before and after the initial load and final store to
> move unrelated instructions out of the way.
>
> - Any other loads or stores (even from regular non-atomic
> instructions!) in any other threads or cores will be replayed if
> they try to access a locked cache line.
>
> - Total store ordering is enforced because stores are visible to
> all other threads and cores the instant they pass through the
> commit stage. This is a result of the simulation method we use,
> which models the cache state machine separately from the actual
> data words, which are always stored to and loaded from physical
> memory instantly at commit (for stores) or issue (for loads),
> regardless of the state of the cache lines in the model.
>
> These rules happen to work correctly because most real code
> (including Linux and all applications using the standard threading
> libraries) always uses atomic instructions and LFENCE/SFENCE just
> to be safe.
>
> The only possibly problematic case is with the Linux seqlock
> system, which depends on total store ordering. It uses LFENCE to
> ensure any other loads in the pipeline have issued before the
> LFENCE proceeds. However, this currently applies only within a
> single thread context, which is incorrect (and should be fixed when
> we get a chance).

How does that not work with the multi-threaded context? LFENCEs would 
operate thread-locally, woudn't they?

I have narrowed down my issue with my split core model to the 
following:
-two threads use malloc/ free and libc internally guards memory 
regions with futexes
-one thread tries to acquire the futex, but fails, as another thread 
holds it
-the acquiring thread then correctly syscalls sys_futex with the 
FUTEX_WAIT and hence blocks the vcpu
-eventually, the holding thread releases the futex and sees that there 
is someone waiting on it, so it syscalls sys_futex / FUTEX_WAKE
-the only issue is that the waiting thread is never woken up again :(

I can see all syscalls mentioned above. It seems to work with the SMT 
model for the times when I've tried it. With my split-core model the 
mentioned bug occurs. I see all mentioned syscalls, but for some 
reason, the thread is not woken up.

I still use a single system-wide MemoryInterlockBuffer and data is 
kept in a single loaction, which is main memory. The cores are 
executed (on cycle granularity) in a round-robin fashion.

Could that relate to the seqlock problem you've mentioned? I really 
can't think of any problems with the memory consistency, as locked 
atomic ops should still work, stores stay ordered and the core-local 
ordering constraints are still enforced.

Could that be some problem with interfacing Xen from the split model 
and perhaps not delivering some interrupts or with vcpuids? I found 
that both is coupled with the Context structure, which is rather 
independent from using multiple cores vs. using multiple threads on a 
single core.

Again.. Klinging to straws :(

Thanks for any pointers!

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst at amd.com, Tel.   (AMD: 8-4903)

AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, 
Deutschland
Registergericht Dresden: HRA 4896

vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, 
Delaware, USA)
Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy