The Timing of the SuperRAMCard

20 MHz is nicely fast - almost a little too fast for the memory installed on
a SuperRAMCard, but it still works. In this SuperCPU-Corner you can read
how, and what it means for the timing.

 by Wolfram Sang (Ninja/The Dreams - www.the-dreams.de)
=======================================================

First, let me emphasize that the following information applies to the 20 MHz
mode only. Of course, you can access the RamCard in 1 MHz mode as well, but
there won't be any timing problems -  the memory behaves just as you'd
expect from a C64. But when you're in turbo mode, it's useful to have at
least at least a rough picture of what goes on in the RamCard: knowing about
these peculiarities, and taking them into account while planning your
programs will surely have a positive effect on their performance. Now, let's
have a look at the structure of a SIMM memory as used on the RamCard.

Precharged?

As CMD mentions in their user's guide to the SCPU, SIMM memory isn't as fast
as the internal SRAM on the accelerator card. On the other hand, they're
significantly cheaper, and easier to get. The reason for their relative
slowness of SIMM memory compared to SRAM is called "precharge". It means
that a memory position, or its cell has to be prepared before it can be read
or written into. Now what exactly is a "cell"? On the SIMMs used with a
RamCard, a cell always contains four consequential bytes. These cells are
aranged in rows and columns (kind of a grid, see figure). Register $D27B
tells you what type of SIMM you're using: for the sake of the example, let's
say we get a 3 - what does that tell us? From table 1 we can see that we
have a 12/10 SIMM. The second number is the number of bits used to address
the columns (=cells). In our case it's 10 bits, meaning there are 2^10 =1024
cells in a row. Since one cell always consists of 4 bytes, the storage
capacity of a row is 1024 * 4 bytes = 4096 bytes = 4 KB. The first number of
the SIMM type tells us that the row addresses are 12 bits long, so we have
2^12 = 4096 rows. Each row equals 4 KB, which makes a total capacity of 4096
* 4 KB = 16384 KB = 16 MB. This calculation is easy to repeat with other
possible values.

=== Table 1

+-------------------------------------------------------------------------------+
|Value of|         |Number of|Length of one row (in    |Number of|Total memory  |
|   $D27B|SIMM-Type|Cells    |Byte)                    |Rows     |(in Byte)     |
| (53883)|         |         |                         |         |              |
|--------+---------+---------+-------------------------+---------+--------------|
|   0    |   9/9   | 2^9=512 |      512*4=2048=2K      | 2^9=512 | 512*2K=1024K |
|        |         |         |                         |         |    (1MB)     |
|--------+---------+---------+-------------------------+---------+--------------|
|   1    |  10/10  |2^10=1024|     1024*4=4096=4K      |2^10=1024|1024*4K=4096K |
|        |         |         |                         |         |    (4MB)     |
|--------+---------+---------+-------------------------+---------+--------------|
|   2    |  11/10  |2^10=1024|     1024*4=4096=4K      |2^11=2048|2048*4K=8192K |
|        |         |         |                         |         |    (8MB)     |
|--------+---------+---------+-------------------------+---------+--------------|
|   3    |  12/10  |2^10=1024|     1024*4=4096=4K      |2^12=4096|4096*4K=16384K|
|        |         |         |                         |         |    (16MB)    |
|--------+---------+---------+-------------------------+---------+--------------|
|   4    |  11/11  |2^11=2048|     2048*4=8192=8K      |2^11=2048|2048*8K=16384K|
|        |         |         |                         |         |    (16MB)    |
+-------------------------------------------------------------------------------+

===

Well, now we know some more, but will it pay off in any way? Let's just say
as much as that it's crucial for the timing, if we have to change to another
cell, or even another row.

Cycle counting

Let's first examine an easy command like LDA $0400. Its execution takes four
cycles: one for decoding the instruction LDA, two to fetch the address
$0400. Finally, in the last cycle the value is read from the memory position
$0400. And this actual read access is the crucial one: on a SuperRAMCard,
this one may take up to 8.5 cycles! 8.5 cycles - did I miss something? Oh
yes, half cycles do exist. To understand this, you have to imagine the clock
signal as a square wave. Each pulse has a rising and a falling edge.
Normally, the processor gets active after each rising edge, but after such
an ominous semi-cycle, it acts after a falling edge. This has no particular
consequences, except that the CPU has to be re-synced for certain internal
procedures; but all this happens without any action by the programmer.

This makes it obvious that routines with a critical timing should avoid the
RamCard and prefer the SRAM instead. But since most programs aren't likely
so terribly sensible, we'll now take a closer look at what needs how long on
a RamCard - see table 2. You'll notice that memory accesses take the minimum
time of 1 cycle, as long as you stay in a precharged cell. If you change to
neighbouring cells sequentially (without skipping any), there's no delay
either. That's because the RamCard has an electronic controller that
"assumes" and optimizes this kind of access, which is very reasonable, as
this is the way code is usually executed.

There are, however, some occasions (like branches, for instance), where you
have to change to a non-adjacent cell located in the same row. This will
cost you two cycles, or even 3.5 if you have to change to another row. These
values hold for read accesses. Strangely, the timing for write accesses is
simpler: writing into the current row takes you one cycle, and 3 otherwise.

=== Table 2

   +----------------------------------------------------+
   | Read inside a cell              |     1 Cycle      |
   |---------------------------------+------------------|
   | Sequential read inside row      |     1 Cycle      |
   |---------------------------------+------------------|
   | Read from a new cell inside row |     2 Cycles     |
   |---------------------------------+------------------|
   | Read from a new row             |    3.5 Cycles    |
   |---------------------------------+------------------|
   | Write inside row                |     1 Cycle      |
   |---------------------------------+------------------|
   | Write to a new row              |     3 Cycles     |
   |---------------------------------+------------------|
   | Read during Refresh             | up to 8.5 Cycles |
   |---------------------------------+------------------|
   | Write during Refresh            |  up to 8 Cycles  |
   +----------------------------------------------------+

===

As we see, accessing a new row takes the most time. Looking at the two
possible types of a 16 MB SIMM, you might conclude that an 11/11 one should
be slightly faster than the 12/10 type. That's right, since it has fewer
and longer rows, which reduces the probability of row changes. However, this
speed advantage is likely a very limited one - so, prefer the 11/11 if you
have a choice, but don't despise a 12/10 one just because of this!

Refresh

Unfortunately we're not done yet - we still have to know how to deal with
the so-called refresh. It's vital for the computer; without it the RAM
memory would forget the data stored in it. In a C64, the refresh is handled
by the VIC, and it normally does the job in the background. At 20 MHz,
however, this activity can't stay hidden; therefore, a refresh signal is
generated every 10 microseconds - or every 200 cycles, which at worst can
prolong a read cycle to up to 8.5 cycles. The worst case occurs if the CPU
wants to read the RAM right after a refresh has started. If we're lucky and
the read access comes near the end of the refresh, the delay will be
shorter. There's no way to predict the refresh, or adjust the program timing
to it (at the present time, that is). But that's not too tragic, as the
power of the SCPU has not nearly been exploited at the moment. But it's
another reason why routines with a crucial timing should not be stored in
the RamCard.

An example

In order to illustrate the above, let's examine the following code, which
shall be running in the fast SRAM. The example is constructed such that the
type of the SIMM doesn't matter:

020000  SEP #$30
020002  LDA $020100
020006  STA $030000
...

Before the SEP opcode can be fetched, its cell ($020000 - $020003) has to be
precharged. The time used is 3.5 cycles, plus another one for the #$30
operand, which is stored in the same cell; this means the comand takes 4.5
cycles in total instead of 2. The LDA command and its operand bytes are
stored sequentially, so only 4 cycles are needed to read these 4 bytes.
However, reading the value in $020100, which is in another cell of the same
row, takes 2 cycles according to table 2; therefore, this command takes 6
cycles instead of 5 (the accumulator is only 8 bits wide). Then, the
processor has to change back before it reads the next opcode at $020006,
taking another 2 cycles. Again, the target address for the STA is stored
sequentially, and takes 3 cycles to read, which means the STA command takes
8 cycles instead of five. The whole routine takes 18.5 cycles (without any
refresh), while it could be executed in 12 if it ran in the SRAM.

This topic may be a little bit different when you first get in touch with
it. But if you just get a bit used to it, you'll get a feeling for what's
better put into the RamCard, and what's not suitable for it. I hope it's
become clear why data to be put in the RamDisk should be stored as
sequentially as possible, and why you should take care to use the fast SRAM
wisely.