The Timing of the SuperRAMCard 20 MHz is nicely fast - almost a little too fast for the memory installed on a SuperRAMCard, but it still works. In this SuperCPU-Corner you can read how, and what it means for the timing. by Wolfram Sang (Ninja/The Dreams - www.the-dreams.de) ======================================================= First, let me emphasize that the following information applies to the 20 MHz mode only. Of course, you can access the RamCard in 1 MHz mode as well, but there won't be any timing problems - the memory behaves just as you'd expect from a C64. But when you're in turbo mode, it's useful to have at least at least a rough picture of what goes on in the RamCard: knowing about these peculiarities, and taking them into account while planning your programs will surely have a positive effect on their performance. Now, let's have a look at the structure of a SIMM memory as used on the RamCard. Precharged? As CMD mentions in their user's guide to the SCPU, SIMM memory isn't as fast as the internal SRAM on the accelerator card. On the other hand, they're significantly cheaper, and easier to get. The reason for their relative slowness of SIMM memory compared to SRAM is called "precharge". It means that a memory position, or its cell has to be prepared before it can be read or written into. Now what exactly is a "cell"? On the SIMMs used with a RamCard, a cell always contains four consequential bytes. These cells are aranged in rows and columns (kind of a grid, see figure). Register $D27B tells you what type of SIMM you're using: for the sake of the example, let's say we get a 3 - what does that tell us? From table 1 we can see that we have a 12/10 SIMM. The second number is the number of bits used to address the columns (=cells). In our case it's 10 bits, meaning there are 2^10 =1024 cells in a row. Since one cell always consists of 4 bytes, the storage capacity of a row is 1024 * 4 bytes = 4096 bytes = 4 KB. The first number of the SIMM type tells us that the row addresses are 12 bits long, so we have 2^12 = 4096 rows. Each row equals 4 KB, which makes a total capacity of 4096 * 4 KB = 16384 KB = 16 MB. This calculation is easy to repeat with other possible values. === Table 1 +-------------------------------------------------------------------------------+ |Value of| |Number of|Length of one row (in |Number of|Total memory | | $D27B|SIMM-Type|Cells |Byte) |Rows |(in Byte) | | (53883)| | | | | | |--------+---------+---------+-------------------------+---------+--------------| | 0 | 9/9 | 2^9=512 | 512*4=2048=2K | 2^9=512 | 512*2K=1024K | | | | | | | (1MB) | |--------+---------+---------+-------------------------+---------+--------------| | 1 | 10/10 |2^10=1024| 1024*4=4096=4K |2^10=1024|1024*4K=4096K | | | | | | | (4MB) | |--------+---------+---------+-------------------------+---------+--------------| | 2 | 11/10 |2^10=1024| 1024*4=4096=4K |2^11=2048|2048*4K=8192K | | | | | | | (8MB) | |--------+---------+---------+-------------------------+---------+--------------| | 3 | 12/10 |2^10=1024| 1024*4=4096=4K |2^12=4096|4096*4K=16384K| | | | | | | (16MB) | |--------+---------+---------+-------------------------+---------+--------------| | 4 | 11/11 |2^11=2048| 2048*4=8192=8K |2^11=2048|2048*8K=16384K| | | | | | | (16MB) | +-------------------------------------------------------------------------------+ === Well, now we know some more, but will it pay off in any way? Let's just say as much as that it's crucial for the timing, if we have to change to another cell, or even another row. Cycle counting Let's first examine an easy command like LDA $0400. Its execution takes four cycles: one for decoding the instruction LDA, two to fetch the address $0400. Finally, in the last cycle the value is read from the memory position $0400. And this actual read access is the crucial one: on a SuperRAMCard, this one may take up to 8.5 cycles! 8.5 cycles - did I miss something? Oh yes, half cycles do exist. To understand this, you have to imagine the clock signal as a square wave. Each pulse has a rising and a falling edge. Normally, the processor gets active after each rising edge, but after such an ominous semi-cycle, it acts after a falling edge. This has no particular consequences, except that the CPU has to be re-synced for certain internal procedures; but all this happens without any action by the programmer. This makes it obvious that routines with a critical timing should avoid the RamCard and prefer the SRAM instead. But since most programs aren't likely so terribly sensible, we'll now take a closer look at what needs how long on a RamCard - see table 2. You'll notice that memory accesses take the minimum time of 1 cycle, as long as you stay in a precharged cell. If you change to neighbouring cells sequentially (without skipping any), there's no delay either. That's because the RamCard has an electronic controller that "assumes" and optimizes this kind of access, which is very reasonable, as this is the way code is usually executed. There are, however, some occasions (like branches, for instance), where you have to change to a non-adjacent cell located in the same row. This will cost you two cycles, or even 3.5 if you have to change to another row. These values hold for read accesses. Strangely, the timing for write accesses is simpler: writing into the current row takes you one cycle, and 3 otherwise. === Table 2 +----------------------------------------------------+ | Read inside a cell | 1 Cycle | |---------------------------------+------------------| | Sequential read inside row | 1 Cycle | |---------------------------------+------------------| | Read from a new cell inside row | 2 Cycles | |---------------------------------+------------------| | Read from a new row | 3.5 Cycles | |---------------------------------+------------------| | Write inside row | 1 Cycle | |---------------------------------+------------------| | Write to a new row | 3 Cycles | |---------------------------------+------------------| | Read during Refresh | up to 8.5 Cycles | |---------------------------------+------------------| | Write during Refresh | up to 8 Cycles | +----------------------------------------------------+ === As we see, accessing a new row takes the most time. Looking at the two possible types of a 16 MB SIMM, you might conclude that an 11/11 one should be slightly faster than the 12/10 type. That's right, since it has fewer and longer rows, which reduces the probability of row changes. However, this speed advantage is likely a very limited one - so, prefer the 11/11 if you have a choice, but don't despise a 12/10 one just because of this! Refresh Unfortunately we're not done yet - we still have to know how to deal with the so-called refresh. It's vital for the computer; without it the RAM memory would forget the data stored in it. In a C64, the refresh is handled by the VIC, and it normally does the job in the background. At 20 MHz, however, this activity can't stay hidden; therefore, a refresh signal is generated every 10 microseconds - or every 200 cycles, which at worst can prolong a read cycle to up to 8.5 cycles. The worst case occurs if the CPU wants to read the RAM right after a refresh has started. If we're lucky and the read access comes near the end of the refresh, the delay will be shorter. There's no way to predict the refresh, or adjust the program timing to it (at the present time, that is). But that's not too tragic, as the power of the SCPU has not nearly been exploited at the moment. But it's another reason why routines with a crucial timing should not be stored in the RamCard. An example In order to illustrate the above, let's examine the following code, which shall be running in the fast SRAM. The example is constructed such that the type of the SIMM doesn't matter: 020000 SEP #$30 020002 LDA $020100 020006 STA $030000 ... Before the SEP opcode can be fetched, its cell ($020000 - $020003) has to be precharged. The time used is 3.5 cycles, plus another one for the #$30 operand, which is stored in the same cell; this means the comand takes 4.5 cycles in total instead of 2. The LDA command and its operand bytes are stored sequentially, so only 4 cycles are needed to read these 4 bytes. However, reading the value in $020100, which is in another cell of the same row, takes 2 cycles according to table 2; therefore, this command takes 6 cycles instead of 5 (the accumulator is only 8 bits wide). Then, the processor has to change back before it reads the next opcode at $020006, taking another 2 cycles. Again, the target address for the STA is stored sequentially, and takes 3 cycles to read, which means the STA command takes 8 cycles instead of five. The whole routine takes 18.5 cycles (without any refresh), while it could be executed in 12 if it ran in the SRAM. This topic may be a little bit different when you first get in touch with it. But if you just get a bit used to it, you'll get a feeling for what's better put into the RamCard, and what's not suitable for it. I hope it's become clear why data to be put in the RamDisk should be stored as sequentially as possible, and why you should take care to use the fast SRAM wisely.