SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Joe NYC who wrote (11212)10/3/2000 11:03:55 AM
From: pgerassiRespond to of 275872
 
Dear Joe:

The problem is that the memory address range would be different for each CPU (in the NUMA plan). The only memory they see at first is their local memory which goes from 0 to the amount of local memory they have. Now when the OS boots, memory from non local memory could be accessed. Each local memory area is assigned a unique number starting from 0 (it could be that each DIMM has a physical location tag and that is used instead). Thus to address any byte in total memory (all memory areas), you take the address and lookup the handle of the memory area that holds that address, subtract the base address of that memory area and use that as the offset to that memory area. Now, all POSIX compliant OSes allow you to make use of "shared memory", which is what total memory really is.

You ask for a range of this shared memory by giving it three things, one is the shared memory handle. The other two are the shared memory window starting address (offset) and the size of the shared memory window you want. For example, give me shared memory area whose handle (id) is 5, starting at 16MB, and 4MB in size. Then the OS tells you that the local address where that shared memory window is mapped to is at 0x80000000 (just at 2GB) or gives you an error.

Now in this way, all CPUs have their local memory area just the same as they do now in the lower part of memory. They can still access a portion of total memory whose addressing scheme is the same for all CPUs. Now is some 32 bit systems, this local memory is kept below a certain point (typically the first 2GB). All other memory (including left over local memory) is accessed by the shared memory mechanism.

BTW, Linus Torvalis and the other kernel development group use this scheme for large memory (>4GB) x86 machines. Current 2.2.x kernels need to be told how much memory they have (if it is above 64MB) at boot time and this is the easy way to get to the architecture above. The current ballyhoo with NUMA manufacturers, is over the method to speed up access to non local (shared) memory.

When 64 bit hardware is used, another architecture becomes possible. This architecture splits memory into three kinds, local memory, non local paged memory, and non local non paged memory. Paged memory is memory which is divided into pages of a fixed size (under Linux this is typically 4KB). These pages are cached into local memory. This is done normally by almost all mainstream OSes to use hard disk space as an extension of physical memory. Non local paged memory just becomes another storage area for pages. These pages are moved around on a as needed basis. Thus NUMA traffic latency is reduced in importance by the fact that the amount of time to move a page is far greater than the setup time (latency) to begin the transfer. Non local non paged memory is that memory which cannot be cached (paged) due to the information is volatile (changes very rapidly). The shared method described above is the best way to handle these segments as they are typically used for inter process (thread or CPU) communication. Thus, paged memory is memory that is updated by at most one CPU at a time, non paged memory can be updated by many CPUs at a time, and local memory always can be updated by the local CPU at any time. The 64 bit system is needed so that total memory is given a contiguous range and only one number is needed to address any word in the total memory (>4GB).

In any case, NUMA systems can be dealt with these two universal methods. LDT based systems seem to make the max data packet 64 bytes (the same as an Athlon cache line) in size. Thus known routing methods can be used to move the needed memory (L1 cache lines) to and from the target CPUs. Neat!

Pete