[PTLsim-devel] Memory consistency
Stephan Diestelhorst
Wed Oct 24 12:06:13 EDT 2007
> The SMT core only enforces these ordering constraints:
>
> - For atomic x86 instructions, the ld.acq uop globally locks the
> cache line via MemoryInterlockBuffer before loading the value; the
> lock only gets released after the entire atomic instruction
> commits. The microcode for atomic instructions creates a memory
> fence (mf uop) before and after the initial load and final store to
> move unrelated instructions out of the way.
>
> - Any other loads or stores (even from regular non-atomic
> instructions!) in any other threads or cores will be replayed if
> they try to access a locked cache line.
>
> - Total store ordering is enforced because stores are visible to
> all other threads and cores the instant they pass through the
> commit stage. This is a result of the simulation method we use,
> which models the cache state machine separately from the actual
> data words, which are always stored to and loaded from physical
> memory instantly at commit (for stores) or issue (for loads),
> regardless of the state of the cache lines in the model.
>
> These rules happen to work correctly because most real code
> (including Linux and all applications using the standard threading
> libraries) always uses atomic instructions and LFENCE/SFENCE just
> to be safe.
>
> The only possibly problematic case is with the Linux seqlock
> system, which depends on total store ordering. It uses LFENCE to
> ensure any other loads in the pipeline have issued before the
> LFENCE proceeds. However, this currently applies only within a
> single thread context, which is incorrect (and should be fixed when
> we get a chance).
How does that not work with the multi-threaded context? LFENCEs would
operate thread-locally, woudn't they?
I have narrowed down my issue with my split core model to the
following:
-two threads use malloc/ free and libc internally guards memory
regions with futexes
-one thread tries to acquire the futex, but fails, as another thread
holds it
-the acquiring thread then correctly syscalls sys_futex with the
FUTEX_WAIT and hence blocks the vcpu
-eventually, the holding thread releases the futex and sees that there
is someone waiting on it, so it syscalls sys_futex / FUTEX_WAKE
-the only issue is that the waiting thread is never woken up again :(
I can see all syscalls mentioned above. It seems to work with the SMT
model for the times when I've tried it. With my split-core model the
mentioned bug occurs. I see all mentioned syscalls, but for some
reason, the thread is not woken up.
I still use a single system-wide MemoryInterlockBuffer and data is
kept in a single loaction, which is main memory. The cores are
executed (on cycle granularity) in a round-robin fashion.
Could that relate to the seqlock problem you've mentioned? I really
can't think of any problems with the memory consistency, as locked
atomic ops should still work, stores stay ordered and the core-local
ordering constraints are still enforced.
Could that be some problem with interfacing Xen from the split model
and perhaps not delivering some interrupts or with vcpuids? I found
that both is coupled with the Context structure, which is rather
independent from using multiple cores vs. using multiple threads on a
single core.
Again.. Klinging to straws :(
Thanks for any pointers!
Stephan
--
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst at amd.com, Tel. (AMD: 8-4903)
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden,
Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington,
Delaware, USA)
Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
More information about the PTLsim-devel mailing list