Darwin/x86

xnu

Aside from things like user-mode ABI changes, the biggest obstacle to running Darwin on a whitebox x86 is the changes made to the xnu kernel. Apple basically removed support for everything that isn't a Core or Core 2 using a modern chipset with things like an HPET. Although most of the hardware support in Darwin lies in kernel extensions, some of it lies in xnu itself. In particular, xnu is responsible for setting up the system timer and enabling PAE and XD on the processor. The xnu kernel can be built without PAE support but it does not do runtime checks meaning that a special non-PAE kernel needs to be built for non-PAE systems.

So, if you're here your probably wondering how to make xnu boot on older systems. If you've paid attention over at the bootloader page, you'll note that xnu will actually boot as is on any Core-based system. It was a simple matter of feeding xnu the right data, just like Apple's own boot.efi does. But the bootloader is only the very small beginning of the boot process.

As is common knowledge, several people have been modifying xnu among other OS X components to make the entirety of OS X boot on non-Apple machines. These modifications are typically done at the machine-code level by disassembling the Apple binaries and changing things such as replacing portions of code with nop instructions or adding jmp instructions over "bad" parts of the code. Some of the modifications even add entirely new sections of code.

Of course, it's much easier to add the appropriate code at the C source level, and a lot more instructive. Some later hacks did work this way. For instance, the semthex (against 10.4.8) and DaemonES (against 10.4.9) source releases. The DaemonES release is arguably the better of the two.

So who is DaemonES? Well, a large portion of the code is marked copyright Arekhta Dmitiri and a little bit of googling seems to indicate that Dmitri Arekhta is the man behind the pseudonym. He did some pretty nice work and it's nice that he released the source to it.

Now before we begin modifying xnu, it's helpful to note what tools we have at our disposal.

VMware

The official method of debugging is to use two machines connected by ethernet. Leopard has now added FireWire debugging which I'm sure is of great help to ethernet driver developers. But the xnu modifications we need to make are localized in the tsc_init and hpet_init functions which happen long before IOKit starts up with ethernet and/or firewire drivers. Arguably, these are probably one of the worst designed parts of xnu. To top it off, these two functions are called just before console init which means a serial debug line is the only way to see how far the kernel has progressed before it bombed and reset the machine.

But there is a way out. VMware offers a standard GDB remote debug stub on port 8832. You only need turn it on with the addition of debugStub.listen.guest32 = 1 to your .vmx file. While you're in there you may also want to add rtc.diffFromUtc = 0 since Darwin expects the PC clock to be set to UTC rather than local time.

Once you've enabled the debug stub in VMware you can run gdb with your compiled kernel file (BUILD/obj/RELEASE_I386/mach_kernel). Boot VMware and let it pause on the boot selector screen. At the gdb command prompt, type target remote localhost:8832. Your VM will immediately stop and GDB is now attached to it. Like any other debugging, you can type cont to continue the program or hit Ctrl+C at any time to stop the program.

Also like any other debugging, you can set breakpoints. However, be aware that breakpoints don't always seem to work correctly. I suspect it has something to do with code selector (CS) changes but not always. One thing that seems to work is if you have the sym/i386/boot.sys file that was used to make your bootloader you can use nm on it to get the address of its _startprog function which you can set a breakpoint on. Then you can set a breakpoint on _pstart (in the kernel) and continue. Your _pstart breakpoint should then hit almost immediately. Perhaps it has something to do with the binary code not yet being loaded when you set the breakpoint? Yet it's guaranteed loaded by the time you hit _starprog in the booter.

And actually, you don't seem to have to break on _pstart, although doing so seems to refresh the other breakpoints somehow. I still haven't figured this out but the bottom line is that eventually you'll get it to stop and you'll be able to step through the very early kernel startup process which is really quite handy.

Not only that, but this could also be an interesting alternative to two-machine debugging if you are writing a driver that can work in VMware. Say, for instance, a USB device. Although in that case 9 times out of 10 you're better off writing that in user mode but that's another issue entirely.

So without further ado, let's see what we need to change.

tsc_init

One of the first parts of the bootstrap process is the tsc_init function. The TSC is of course the Time-Stamp Counter found on all Intel chips since the original Pentium. There's really no need to hack xnu to support processors without a TSC since you'd likely never want to run xnu on such a processor. I mean, really, who wants to run on a 486? I sure don't. I don't even have a 486 anywhere and I am known for being a packrat of older machines.

What the TSC does is count upwards once with each processor clock tick. It provides a nice way to see how much time has elapsed. If you know the CPU frequency and the difference between two reads of the TSC you know exactly how much time has elapsed. This is important to xnu because xnu is a so-called tickless kernel. That is, rather than maintaining a fixed time-slice, xnu is able to maintain wall-time by basing it off of the number of processor cycles that have elapsed, that is, the difference between two reads of the TSC.

Of course, in order to translate from wall-time (i.e. seconds) to processor ticks you need to know how many of those ticks occur in a second. Which is of course quite simply the processor's MHz rating.

Apple, however, seems to "Think Different"(tm) when it comes to determining processor MHz as compared to most other operating systems. It isn't necessarily a bad thing though. Apple's method is based on the assumption that the CPU clock runs at an integer multiple of the bus clock. In the code, this multiplier is refered to as the tscGranularity. This is, of course, the "CPU Multiplier" that most anyone with any familiarity of computer hardware is already quite familiar with. If you know the bus speed and know the CPU multiplier, then you thus know the CPU MHz.

Apple's code starts off with the assumption that the bootloader (which is typically Apple's own boot.efi) has filled in the FSBFrequency property of the "IODeviceTree:/efi/platform" node. That's another discussion entirely but the quick summary is that the modified boot-132 on my bootloader page currently simply fills this in at 200 MHz which is not entirely accurate since no clock really runs that smoothly, but close enough assuming you really do have a 200 MHz bus. I do since I've got a MacBook Pro and the other machine I've run xnu on is a P4 3.0 GHz with a 915G chipset. Both have a declared 200 MHz FSB. Of course, with the real boot.efi, the FSBFrequency is not quite exactly 200 MHz but actually a tiny bit less in my case.

The next thing that tsc_init does is retrieve the CPU clock multiplier. This is achieved by using the rdmsr instruction with the 0x198 (IA32_PERF_STAT) MSR number. The interesting thing here is that while Intel documents that MSR as existing since the Pentium 4, they only document bits 15:0 as being valid. It is not unti the Core 2 processor that Intel documents bits 44:40 as containing the maximum bus ratio. Apple must know something that Intel knows because they specifically check for either Core or Core 2 processors then proceed to use those bits from the MSR. Interestingly enough, adding CPU family 0xf to the list of valid platforms also works for me on my P4. Somewhere (not in this particular Intel manual) I have seen that indeed later model P4 (such as the one I have) do indeed implement this MSR just like Core and Core 2 CPUs.

Perhaps even more interesting is that the particular MSR is not necessarily valid unless XE (a.k.a. Intel Dynamic Acceleration) is enabled. I assume that means MSR 38h but I could be wrong. At any rate, Apple's code blindly assumes that the MSR works (i.e. doesn't issue a #GP(0)) and returns valid data. If the MSR returns 0 then all hell breaks loose because the tscGranularity value is then immediately used as the denominator of an integer division. Again, this is not exactly the best written code. It pretty much assumes Core-based processors. Period.

The problem is that when I'm running under VMware I of course have a Core-based processor. But VMware returns 0 for that MSR and doesn't issue #GP(0). So reading the MSR with a recovery trap is not an option. One simply has to make sure that tscGranularity is not 0. If it is zero, then some other method needs to be used.

The aforementioned DaemonES 10.4.9 release improves on Apple's implementation of this function by implementing support for a number of other chips. Pretty much every recent Intel and AMD chip appears to be supported. However, it does need a little bit of work and to be honest I did not care for the way he measured CPU and BUS Frequency together using the PM timer. It seemed to do only one measurement and doing this under VMware (where ironically this is actually necessary) would seem to require slightly more robust code. Having had a look at the old xnu code that uses the i8254 and the current Linux code which does the same although in a much nicer manner I decided that the current Linux method was by far the best. Ultimately though, it doesn't have to be all that accurate since it's only determining the nearest integer multiple due to the bus speed having already been provided by the boot loader.

The HPET

The HPET is a feature found in modern Intel chipsets. It is also known as the multimedia timer. If you are starting xnu on a modern machine you probably have one so there is no need to hack the xnu source. Unfortunately, if you are trying to run under VMware you will note that VMware doesn't virtualize the HPET. That's a bummer because running under VMware is a hell of a lot easier than keeping an entire spare machine around. Besides that, it does not appear that it is legal to run Darwin on anything other than a real Mac due to the need to use some critical kernel extensions from the OS X release. So unless you have a spare Mac lying around to devote to Darwin, you're going to want to use VMware.

The hpet_init function is a very simple bit of code that reads the LPC's RCBA register directly from the PCI bus 0, device 31, function 0 config space. It seems that all ICH chipsets (even back to ICH 4 a.k.a. 845) have the LCBA on 0:31.0 with the RCBA register at the same config offset. But VMware virtualizes an old PIIX4 (a.k.a. 440BX) which of course has an ISA controller, not an LPC. There is no VMware HPET.

The problem with the xnu code as-is is that it assumes a PCI device exists at 0:31.0 and it assumes that that device is the LPC. To be honest, I am not entirely sure why it does this. This got added around the 8.5 (i.e. 10.4.5) timeframe although you can only know this by disassembling the binary since there are no corresponding source releases. It looks like it was some weird hack to get rid of the need for the old i8254 PIT code which was admittedly hairy.

What I don't get though is that it seems the HPET is only really required by the AppleIntelCPUPowerManagement.kext. There it makes sense because that kext controls the aforementioned Intel Dynamic Acceleration (i.e. the latest incarnation of SpeedStep) which modifies the CPU clock multiplier and thus changes all of those careful calculations done in tsc_init. Of course, if the processor completely naps, then the TSC doesn't increment at all and the HPET is then the source of accurate wall time.

So that explains the need for the HPET but what I don't get is why Apple wrote it the way they did. Within xnu, the hpet is set up seemingly solely for the purpose of passing its info to the power management extension and it seems to me that it would have possibly been better to just have that kernel extension work the HPET directly. Or perhaps to have everything to do with the HPET localized to the AppleHPET kernel extension and let the CPU power management consume that as a provider nub.

On the other hand, perhaps there was some particular design decision to put the HPET/CPU interaction within the kernel, but that doesn't even really make sense either.

At any rate, the fix is simple. Don't go looking for the RCBA if the PCI vendor of the 00:31.0 device is 0 or 0xffff (i.e. all bits on or all bits off). Of course, that leaves some nVidia chipset users without a working HPET and it also means that merely loading AppleIntelCPUPowerManagement.kext will crash the kernel but IMO that's due to shoddy design of this whole system and this is the best I'm willing to do for now because anything more requires a pretty big rewrite.

I'm also somewhat confused about the DaemonES source patch which actually reimplements rdHPET to use the PM timer. Considering that AppleIntelCPUPowerManagement.kext seems to be the only client it doesn't seem to be required unless you need CPU power management. Of course, in VMware this doesn't apply so we can just not use the kext.

The "commpage"

The commpage on xnu is a read-only area mapped in to each user-mode process that provides various things useful to all processes. One oddity you might encounter is the dsmos_blobs array of pointers. That is the system integrity data, the purpose of which is not entirely clear. Actual protection of binaries occurs because the binaries are encrypted with a key. The dsmos blobs simply contains the now infamous magic poem. I guess a program can check to make sure it exists and fail if it doesn't? This is not our concern here. If you want more information about the blob you can read Amit Singh's write-up entitled Understanding Apple's Binary Protection in Mac OS X. Note that Mr. Singh does not cover the specific decryption process except to say that the code for that is a function in "Dont Steal Mac OS X.kext".

The commpage setup occurs in "osfmk/i386/commpage/commpage.c". Related code is in that directory. The more interesting parts of the commpage are the routines it provides. For instance, memcpy is not implemented in OS X's equivalent of the C library (libSystem) but is instead located at a fixed address in the commpage. During xnu startup, the kernel checks for various CPU features and provides an appropriate implementation of the function depending on the available CPU features.

For each subarchitecture (32-bit and 64-bit) there is a list of routine descriptors, commpage_32_routines and commpage_64_routines. Each list contains one or more implementations of each routine. Of particular interest is the bcopy routine. The same routine actually has two entrypoints. The first is bcopy, the second is both memcpy and memmove. The general idea is to load the source argument into esi, the destination argument into edi and the count into ecx then check for overlapping regions and jump to the appropriate forward or reverse implementation. The memcpy/memmove entrypoint is found exactly 32 bytes later and does the same thing taking into account that the first two arguments are reversed. In all cases the arguments are checked for overlap rather than pointing memcpy to a special entrypoint that does not check for overlap. That means that memcpy used on OS X can potentially fail to produce the expected results on other more typical UNIX systems. Be warned if you port your code from OS X to other platforms carelessly

For 32-bit code there are currently 3 implementations of the bcopy routine. The first is a pure scalar implementation using the obvious movsl with movsb tail implementation. The second is an SSE3 implementation which checks that the length is >= 80 bytes before going with SSE3, otherwise using a normal scalar loop. The third is a so-called SSE4 implementation but is really an SSSE3 (Supplemental SSE3) implementation. Okay, nothing too interesting except that it's possible (or not) that implementations using older SSE instructions might be provided for older processors.

For 64-bit code the situation is different. There is exactly one bcopy routine and that routine requires SSSE3 (again labeled as SSE4). An Athlon 64 or a later model Pentium 4 has 64-bit support but certainly lacks the Supplemental SSE3 found only in Core 2 Duo and Core-based Xeons. Early Athlon 64 might even lack SSE3. Unfortunately, Apple has not seen fit to provide the obvious scalar implementation. This would be a useful patch that ought to be in the Apple-provided xnu tree. My guess is they simply failed to implement it because none of their current models needed it. I doubt this is some sort of anti-hack feature.

If you have a 64-bit processor without SSSE3 (e.g. late-model Pentium 4, Athlon 64), don't care about 64-bit support, and don't want to add the scalar implementation to the kernel source you can simply add the -legacy flag to the boot command line to turn off 64-bit support.

A proper implementation would include not only a pure scalar implementation but also an SSE3 version. Apple already provides a 32-bit SSE3 implementation so presumably it can be modified to support 64-bit loads and stores. The SSSE3 version is almost identical between 32-bit and 64-bit aside from the use of the 64-bit registers (rsi, rdi) instead of 32-bit registers (esi, edi).

The State of the Kernel

The biggest problem with these xnu modifications is that it still only gets us part of the way to a free Darwin. I've gotten xnu to start on my P4 but I can't go beyond that without using some number of binary-only kernel extensions. The licensing for these is questionable. At one time they were freely licensed but those versions (from the Darwin 8 binary spin) are not ABI compatible with modern x86 kernels.

What you see here is basically research into how the kernel can be modified to support more processors and chipsets. I currently use a late-model P4 normally running Linux for my testing on real machines so I don't need to modify the source except a very slight bit. But the real testing of Darwin/x86 is done on VMware on a real Mac and so other than the tsc_init and hpet_init functions, the source does not require any modification.

Downloads

None yet. This is a work in progress.