May 19, 2013
Background — High Performance Attached Processors Handicapped By Architecture
The application of high-performance accelerators, notably GPUs, GPGPUs (APUs in AMD terminology) to a variety of computing problems has blossomed over the last decade, resulting in ever more affordable compute power for both horizon and mundane problems, along with growing revenue streams for a growing industry ecosystem. Adding heat to an already active mix, Intel’s Xeon Phi accelerators, the most recent addition to the GPU ecosystem, have the potential to speed adoption even further due to hoped-for synergies generated by the immense universe of x86 code that could potentially run on the Xeon Phi cores.
However, despite any potential synergies, GPUs (I will use this term generically to refer to all forms of these attached accelerators as they currently exist in the market) suffer from a fundamental architectural problem — they are very distant, in terms of latency, from the main scalar system memory and are not part of the coherent memory domain. This in turn has major impacts on performance, cost, design of the GPUs, and the structure of the algorithms:
- Performance — The latency for memory accesses generally dictated by PCIe latencies, which while much improved over previous generations, are a factor of 100 or more longer than latency from coherent cache or local scalar CPU memory. While clever design and programming, such as overlapping and buffering multiple transfers can hide the latency in a series of transfers, it is difficult to hide the latency for an initial block of data. Even AMD’s integrated APUs, in which the GPU elements are on a common die, do not share a common memory space, and explicit transfers are made in and out of the APU memory.
- Cost — The necessity for additional buffer memory in the GPU to store the working and prefetched data blocks adds cost to the GPU. In addition, the logic to support high-speed data transfer adds costs.
- GPU design — GPU designs have been tending toward increased core count due to the need to operate on ever larger blocks of prefetched data to improve the performance of current algorithms.
- Algorithms — In a reinforcing cycle with current designs, algorithms for GPUs are written based on the assumptions that there will be a large transfer latency and that they cannot have coherent memory access to the main system memory, reinforcing current design practices.
Over the last few months, AMD has been quietly rolling out a new technology, heterogeneous Uniform Memory Access (hUMA), to be implemented on future unannounced AMD products. hUMA will fundamentally alter the relationship of the GPU to the main system CPU by providing the GPU with peer access to the main system memory space along with the scalar processor. The key technical features of hUMA will include:
- Unified memory with bidirectional coherency and uniform access — This is the truly transformational feature of hUMA if it works as planned. The GPU and the scalar CPU will now share the same uniform memory space, and both can now allocate and operate on common memory with no need for redundant buffering on the GPU, and no need for explicit transfers.
- Shared coherency domain — hUMA implements a common coherency mechanism, which will open up the potential for multiple GPUs operating on the same shared data and combinations of GPU and CPU in new algorithms, since each will automatically have the latest version of memory served either from cache or through a common cache-miss process. This will certainly change the way GPU-based algorithms are implemented, since developers can now be more flexible in where they perform processing with no penalty for moving data blocks.
- GPU can take page faults and allocate memory — The GPU can take page faults, allowing it to work out of non-locked memory.
The potential impact on CPU/GPU integration is immense. As an embryonic developer I wrote significant code on real-memory systems where a considerable amount of my time was spent juggling overlays of data and code that were locked to physical memory, and the transition to a uniform virtualized memory model transformed programming. While hUMA will still require explicitly exposing parallelism in algorithms, it will remove an unnecessary and inconvenient barrier to its proper exploitation.
Where Are The Products?
Today there are no hUMA products, but with the pacing of AMD’s communications and the expectation of new processor generations from both AMD and Intel, my guess is that we will hear from AMD on the subject of hUMA products before the year is out. Because there is currently also little or no hUMA software IP, we can probably expect that the initial products will be in the server space targeted at lab and HPC users, along with a push to promote hUMA software IP. Since AMD’s server market has fallen to less than 5% according to some observers, AMD has little to lose by trying to shake up the server market again. Consider also that AMD has an architecture license for ARM, so there is also the potential to cook up some interesting mixes of very low power scalar ARM cores hUMA-coupled with GPUs in addition to hUMA-based variations of current AMD APU designs.
What Are The Risks?
As always, with new technologies, there are risks. For hUMA as a catalyst for increasing AMD’s market share, the risks are pretty straightforward:
- Performance risks — Unified memory architectures for graphics took a long time to shake out on PCs and still suffer performance limitations compared to dedicated architectures for some applications. Developers persisted due to potential cost advantages, and they now work well enough to be mainstream. hUMA could suffer from performance teething problems, delaying its acceptance. There are also likely to be upper limits on the number of GPU cores that can participate in the cache coherency scheme, and while it is hard to predict, this number will probably be well below the current 500 – 1000 cores of the highest end GPUs.
- Capturing developer mindshare — As I noted above, having gone through a similar technology transformation, it is hard to see that developers will not be all over this one. But, and this is a very big but, when I was developing, nobody paid any attention to the relative market shares of the mainframes we were working on. AMD faces a classic chicken and egg problem — how to attract attention to a transformational technology that will only pay off for some developers if AMD gains market share, and AMD cannot gain market share without the developers in the first place. The likely outcome is incremental acceptance after an early demonstration phase where leading users demonstrate hUMA advantages, and others follow.
- Intel duplicates hUMA — The advantages of hUMA are not a secret, nor are the difficulties engendered by today’s GPU architectures. If Intel is working on a hUMA equivalent project, even the threat of it could stall acceptance of AMD’s alternative. Currently, Intel is not generating a lot of revenue from its Xeon Phi products, so a change in product architecture might be more embarrassing than expensive if they were to do so quickly. However, regardless of timing, what is certain is that Intel will be forced to react to this move by AMD in some fashion — the potential advantages for hUMA-based computers is too significant to ignore.
From a developer's perspective, hUMA resonates with overtones of Victor Hugo’s “nothing is stronger than an idea whose time has come” (this is actually a popular paraphrasing, not a literal translation). If you are even peripherally involved in high-performance computing, this is a development that you should be tracking very closely, and when the products become available, get one and start developing — you could be on the forefront of a major sea change in parallel application development.