Channel Insider content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Keeping It All in Balance

As processor speeds have rapidly outstripped the ability of memory to keep pace, achieving system balance has been getting more and more difficult. If an application can’t find the information in the CPU’s on-chip cache memory, then a program can come to a screeching halt while waiting hundreds of CPU cycles for data to come back from main memory. With new multicore CPUs, multiple processors will be competing with the graphics processor and the I/O subsystem for access to the memory system. Already overworked, the memory system has to be fast enough to provide “concurrent” access to all those memory clients; otherwise system bottlenecks develop, and somebody has to wait.

Let’s take a look under the hood at how memory works, and try to glean some insights as to how we can build a better balanced system. Bear in mind that we really can’t achieve perfect balance, given the disparity between processor and memory performance. But a better understanding of the inner workings of the common memory architectures can at least help us better optimize our systems. Continued…

The speed of the DRAM memory cells themselves has increased only slightly over the years, and the column frequency (the rate at which the chip can access bit locations) of mainstream memory is unlikely to exceed 200 MHz before the end of the decade. The diagram below from Micron illustrates the challenge facing memory designers. The red line on the graph plots the relatively slow improvement in column frequency. On the yellow line are peak data rates for each of these devices, quite a dramatic contrast as memory architects come up with nonlinear improvements in the way memory cells are accessed. Since it’s difficult for designers to make the memory cells much faster, higher performance must come from clever architectural improvements to the control logic surrounding the internal memory arrays.

Memory Chip Designers Finally Get Some Respect

The reason for the diverging speed curves is related to the unique nature of DRAM memory architecture and some harsh laws of physics that limit the performance of the capacitive storage elements. While CPU designers optimize for speed, memory chips are primarily designed to be packed together as densely as possible and provide high manufacturing yield. Faster transistors are larger and burn more power, something CPU designers were willing to live with until recently. Craig Barrett, former CEO of Intel, gave the startling admission that CPU designers have finally hit a thermal wall in the quest for faster circuits.

Memory designers have been bumping against this wall for some time, since they haven’t had the luxury of huge heat sinks and fans to keep their chip die temperatures nice and cool. As readers of ExtremeTech know very well, CMOS devices run faster at lower temperatures. While DRAM designers have been able to solve the speed problem by accessing lots of memory banks in parallel, this technique is limited by that pesky thermal wall.

To get around this problem, memory designs have become very sophisticated and include complicated control logic to deliver high data rates. As you’ll see in a moment, the DRAM designers should have been getting respect all along, since the circuits inside a storage cell have an astounding amount of complexity that would send logic designers running to their physics texts.

Pipelining—Not Just for Processors Anymore

Using a technique somewhat like CPU pipelining, memories now prefetch information to keep data flowing at high clock rates. Just as a modern CPU will prefetch multiple cache lines, memory chips prefetch large blocks of data from the relatively slow memory. That way, memory data can be stored in a fast buffer and fed to a screaming-fast pipeline that streams data out of the chip.

Paying the Latency Tax

Unfortunately, nothing in life comes for free—something CPU architects discovered as they competed to build ever-longer processor pipelines. As long as a pipeline is moving in a predictable manner, then high speeds are achieved, leading to high data bandwidth. When a pipeline needs to take a new direction or restart, then the delays add latency to the processing. Even though two memory modules may be labeled with the same clock frequency, lower-cost devices will often need extra cycles of latency to operate at those frequencies. To know if you’re getting your money’s worth, you’ll need to go a little deeper than just checking the clock rate. Continued…

While memory architecture has gotten more complex, the fundamental organization of a DRAM has not changed much. Basically, DRAM is organized as an array of rows and columns. In modern chips, a single column location on a row actually contains more than one memory cell (bit). The number of cells equals the width of the DRAM’s data bus. For example, the following illustration shows a chip with an 8-bit data bus, so a location at any row/column coordinate holds 8 bits, all of which are accessed simultaneously.

A DRAM may have multiple banks of these memory arrays, as shown in the simplified diagram of an older 32MB chip. The 32MB DRAM is divided into four banks, each having 8,192 rows and 1,024 columns. Multiply everything (8Kx1Kx8x4), and you’ll see that this device is a 256 megabit memory—far smaller than the multi-gigabits in newer chips.

Controlling Rows and Columns Through RAS and CAS

To understand modern memory architectures, it’s helpful to look back at devices where chip package pins had more-direct control of the memory circuits. A combination of two control signals, the Row Address Strobe (RAS) and Column Address Strobe (CAS), selects a specific memory location (one byte, in our example) for reading or writing. Note that although the diagram shows external pins for these signals, these pins no longer exist; memory chips now produce CAS and RAS signals internally in response to external commands.

Sometimes you’ll hear the total bits in a single row referred to as a page, which is the number of cells that the chip can simultaneously activate for reading or writing. The page size is usually the number of columns multiplied by the width of the data bus. In our diagram, where a row has 1,024 8-bit column locations, a page contains 8,192 bits (1,024 columns per row times 8 bits per column location). When given a RAS signal, a group of sense amplifiers will sense the stored value on the tiny capacitors at each bit location in a page. A CAS signal would then allow the system to rapidly read the data bits at each column location of the opened page. Continued…

A memory cell stores its value as charge on a capacitor. Every bit in a DRAM is a capacitor that can be charged up with some electrons to indicate whether you’ve stored a 1 or a 0. From our physics texts, we know that applying a voltage to a capacitor will cause current to briefly flow and store electrons on one side of the capacitor. Once the voltage is removed, the charge on the capacitor storage element will slowly drain away (due to an effect called leakage). To keep the storage bits from losing their information, every cell in a DRAM must be periodically accessed to refresh its state. Modern memory chips use a self-timed refresh operation that runs in the background.

The following diagram illustrates a simplified circuit to store and sense charge on a DRAM capacitor. Each capacitor has an associated switch transistor that “opens up” a connection to the capacitor. The Word lines enable all the capacitors in a page. Bit lines connect each cell to a corresponding sense amplifier (not shown) that can detect the value stored in the capacitor. First, the chip sets the bit lines to a reference voltage, a process called precharging. When given a RAS signal, a page of memory bits is “opened”, and all the sense amps detect the small voltage differences caused when the capacitors are switched onto the Bit lines. A full page of data from the sense amps is stored in a fast buffer. This then allows the CAS signal to rapidly read the column locations stored in the opened page.

One of the speed limitations for DRAM cells is that the capacitors can’t be made much smaller without quickly leaking away their charge or becoming susceptible to alpha particles that would cause memory errors. An interesting follow-up topic to this article would examine what happens as the capacitors continue to shrink. Already, designers are starting to grapple with the issue of actually counting the precise number of electrons on a capacitor storage element. How do you share charge with a bit line when the number of stored electrons is approaching one? The world of quantum computing may arrive sooner than we think. Continued…

We noted earlier that you no longer find CAS and RAS pins on modern memory chips. As memory architecture has gotten more complex, the number of signals needed to coordinate internal activity has increased. To reduce the number of pins and boost flexibility, memories now connect to the outside world via a command bus. Designers can encode dozens of operations as special commands, rather than building in extra pins to directly control these functions. Memory chips decode the commands into the individual operations internally.

So, for example, the ACTIVATE command now replaces the RAS function to open up a new row (or page) of memory. And instead of a CAS signal, the READ command transfers bits from column memory to the data bus. Precharge can be an explicit command or can occur automatically before one page is closed and another opened. But just to keep things confusing, even though memory chips no longer have dedicated RAS and CAS pins, the latency parameters of newer memory devices are often described in these traditional terms.

Changing Latency Parameters for a Memory Module

On a memory module itself, in addition to multiple RAM chips, you’ll find a serial presence detect (SPD) chip, a small serial storage device that contains the module’s defaults for timing and other parameters. Naturally, overclocking fiends use the BIOS to plop in their own values, taking advantage of the fact that CMOS runs faster as you increase the voltage. Operating the memory at a higher voltage will usually allow lower latency values, higher clock rates, or both—assuming you don’t fry anything.

As with any overclocking endeavor, there is always the risk of system instability or the smell of melting electronics. But makers of high-performance memories will often test their modules at higher voltages and suggest lower latency settings.

Interpreting the Latency Numbers

There are many DRAM timing parameters used by the memory controller in your PC’s chipset (or integrated into AMD processors), but you can adjust only a few parameters in the system BIOS. For DDR and DDR2 memory, vendors specify four minimum timing parameters, measured in memory clock cycles:

  • CAS Latency (tCL): Column access (READ) until data is available
  • RAS to CAS Delay (tRCD): Row access (ACTIVATE) until CAS (READ)
  • RAS Precharge Delay (tRP): Precharge until row access (ACTIVATE)
  • Precharge Delay (tRAS): Row access (ACTIVATE) until precharge

For instance, a high-performance DDR memory module with 2-2-2-5 timing would have a minimum CAS latency of 2 clocks, a RAS to CAS delay of 2 clocks, a RAS precharge delay of 2 clocks, and a precharge delay of 5 clocks. A high-performance DDR2 module might have 5-5-5-12 timing, illustrating the point that latency (when measured in memory clocks) has actually increased in the latest memory generation. Continued…

Good question, young Padawan. Recall that the actual speed of the memory cells has not increased dramatically, so a memory architecture that clocks twice as fast will likely have twice as many clocks of latency to access that same underlying memory cell. But as the timing diagram illustrates, while latency doesn’t improve, the higher data rate delivers better bandwidth. Once started, a memory burst completes in half the time.

In reality, DDR2 may have more clock cycles of latency, but the clock rates will scale much higher than with DDR because of slightly relaxed timing constraints and improved signal integrity. More importantly, lower voltages and smaller page sizes have cut back on the power consumed by an active page. Lower-power architecture becomes important as DDR2 speeds scale to 800 MHz, even though the underlying memory cell will still run at a measly 200 MHz.

System Performance Is What Really Matters
After you’ve done all your research into the detailed specifications for a memory subsystem to match the other fire-breathing components in your new PC, remember that performance gains are very dependent on the applications you run. A higher data rate with longer latency parameters may actually reduce the “overall latency” of getting a block of data. As you can imagine, this is only a benefit if your program needs all those extra bytes of data. What if you really only wanted a single word of data from a new page in memory? In that case, the RAS to CAS delay becomes critical, and CAS latency may also have a notable impact.

In many cases, shaving off a few cycles of latency may only boost system performance by single-digit percentage points. If an application streams large blocks of data to and from contiguous memory, latency won’t have much of an impact. If memory is constantly accessing new DRAM pages, however, prefetching won’t help much, and low latency becomes critical. To really answer the question about performance, stay tuned to ExtremeTech as the high-performance memory modules are put through their paces and compared in real systems.

Armed with a deeper understanding of what goes on in these memory chips, perhaps your next PC purchase will start at the memory isle in the computer store. Once you’ve found the right memory subsystem, then you’ll be able to search for a CPU and a graphics card that can keep up.

Related articles: