DOOM v1.666 running in ZPU simulator. |
So what is the point of doing this?
Well, it is just for fun I guess. A ZPU based game console will have a hard time to beat the next Xbox 720 or PS4. Infact the ZPU is pretty slow and not very well suited for graphics.
Having that said, DOOM on ZPU is interesting from a profiling and verification point. Without any hard evidence I would say DOOM is a more interesting work load than Dhrystone. It also tests the toolchain and hardware quite a bit.
To summarize these are the leanings/status:
The following is currently not supported in the port:
Some add-ons to the Phi platform was needed:
-timedemo demo3
(Again, taken from this. They are running the original DOOM port. The ZPU port is ported from BOOM.)
A typical frame takes about 7400000 cycles with the ZPU platform described above. Stairs, enemies, shooting and transparency are some things that will increase render time. The table below contains a breakdown for frame 200 of the demo run, this frame is good typical frame. The frame takes 9562694 cycles to complete.
(This is just for frame 200 but other frames should be somewhat similar)
There seems to be something fishy here.The majority of time should be spend inside the R_DawSpan and R_DrawColumn functions, these are the functions that does all the pixel drawing. Instead most time is spend doing 64bit arithmetic.
(See http://fabiensanglard.net/doomIphone/doomClassicRenderer.php for some interesting profiling results).
#define FRACBITS 16
__inline__ static fixed_t FixedMul(fixed_t a, fixed_t b)
{
return (fixed_t)((long long) a*b >> FRACBITS);
}
The ZPU does not have support for 64bit hardware multiplication or division so the compiler relies on a software to implement the long long operations. A multiplication takes almost 900 cycles, 400 of those cycles are spent inside two 8 byte memcpy. The division routine has a similar behavior.
There is plenty of room for improvement in these routines. A good approach would be to do a assembler optimized routine. Better but not portable would be to use a dedicated hardware for the fixed functions above. FPGAs usally has a lot of multipliers in them that can be used for the multiplication. Many FPGA vendors supply IPs for dividers, there are also some available on opencores.com
These hardware blocks would then be memory mapped, this would be easier than adding a new instruction to the instruction set. The hardware accelerated multiplier would be very quick the divider a bit slower.
Including function overhead the FixedMul could take less than 50 cycles and the FixedDiv less than 100 cycles, but this is just a very wild guess.
The memcpy complexity is approximently:
cycles_in_memcpy = 200 + bytes_to_copy *3.5;
Any small copy will have a pretty large overhead and large copies aren't that fast either. I am having trouble locating what memcpy being used, the only one I can find is this (zpugcc/toolchain/binutils/libiberty/bcopy.c):
void
bcopy (src, dest, len)
register char *src, *dest;
int len;
{
if (dest < src)
while (len--)
*dest++ = *src++;
else
{
char *lasts = src + (len-1);
char *lastd = dest + (len-1);
while (len--)
*(char *)lastd-- = *(char *)lasts--;
}
}
This is one of the innerloops inside R_Drawcolumn:
do
{
// Re-map color indices from wall texture column
// using a lighting/special effects LUT.
// heightmask is the Tutti-Frutti fix -- killough
*dest = colormap[source[frac>>FRACBITS]];
dest += SCREENWIDTH;
if ((frac += fracstep) >= heightmask)
frac -= heightmask;
}
while (--count);
This loop would be a good candidate for hardware acceleration. The software would not need to wait for the hardware accelerated loop to finish, instead it could go on and do other things.
Each iteration takes about 30 cycles and the count value varies greatly but for this particular frame the average value is about 20. So around 600 cycles could be saved.
The R_drawspan has a similar but not identical inner loop that would also benefit greatly from hardware acceleration.
There is plenty of room for optimizations, software only optimizations would give a nice bump but HW acceleration is needed to reach 35 FPS on a FPGA board clocked at 50 MHz.
A first guess might be that a 3D game would have a pretty characteristic profile when it comes to what instructions being used most. One might suspect lots of multiplications and divisions, but this is not the case.
Scroll down for a complete list of the opcode histogram. This histogram calculated from the entire run not just frame 200.
Here are some comments regarding some of the instructions:
This is a typical example:
count = dc_yh - dc_yl;
32798: 73 loadsp 12
32799: 76 loadsp 24
3279a: 31 sub
3279b: 55 storesp 20
fracstep = dc_iscale;
32532: af im 47
32533: 99 im 25
32534: c8 im -56
32535: 08 load // load from bccc8
00032536 <.LM6>:
frac = dc_texturemid + (dc_yl-centery)*fracstep;
32536: 79 loadsp 36
32537: af im 47
32538: b0 im 48
32539: a4 im 36
3253a: 08 load // load from bd824
3253b: 31 sub
3253c: 71 loadsp 4
3253d: 29 mult
3253e: af im 47
3253f: a0 im 32
32540: 80 im 0
32541: 08 load // load from bd000
32542: 05 add
00032543 <.LM7>:
register const byte *source = dc_source;
32543: af im 47
32544: 99 im 25
32545: c4 im -60
32546: 08 load // load from bccc4
All global variables are located close in memory but they are all fetched using absolute addresses. IMs are used to describe the absolute addresses. In fact almost all arrays and global variables are loaded using 3 im to address them. Even if most global variable are close together
It would be nice to have something like a frame pointer and then a relative load.
im 47 // set fp to at bccc4
im 25
im -60
In the above example 3 instructions would be saved using a loadfprel instruction.
(To clarify, there is no such function. But it might be a good addition if there ever is a ZPUv2)
Opcode histogram for all frames
Well, it is just for fun I guess. A ZPU based game console will have a hard time to beat the next Xbox 720 or PS4. Infact the ZPU is pretty slow and not very well suited for graphics.
Having that said, DOOM on ZPU is interesting from a profiling and verification point. Without any hard evidence I would say DOOM is a more interesting work load than Dhrystone. It also tests the toolchain and hardware quite a bit.
Mr DOOM in trouble |
- With HW acceleration it is possible to play DOOM on FPGA
- long long operations and memcpy are not very great with the current toolchain
- Port is running modified Phi platform
- This is only tested in emulator
The following is currently not supported in the port:
- Sound
- Input
- Network
Some add-ons to the Phi platform was needed:
- Mode 13 frame buffer
- Doom1.wad preloaded into memory
- 35Hz timer
Performance
So how fast is DOOM on ZPU?
With the following setup:
- All instructions implemented
- 225 MHz
- Loads takes 2 cycles (i.e no RAM latency)
At 225 MHz an average 30.5 FPS with some rare 20-25 FPS are achieved when running 'timedemo demo3'. Resolution is the default Doom, 320x200.
To reach the same performance on a PC something like a 486 @ 66MHz and a decent graphics card is needed.
So is the ZPU about 1/4 the speed of a 486?
No, the comparison is a bit unfair:
No, the comparison is a bit unfair:
- In simulation the frame buffer has no latency, this is NOT true for computers from the 486 era. A good graphics card could almost double FPS.
- ZPU port does not have any sound.
- RAM access has 1 cycle latency in simulation, this is far from true for an old PC.
It would be hard to reach >100 MHz on a FPGA, as far as I know all ZPU implementations barley reach 100 MHz. But with some tricks it would still be possible to achieve decent performance (see profiling for details):
- Some of the inner loops and fixed point arithmetic could be replaced by hardware units.
- There is a frame buffer memcpy done every frame, this could be removed if a double buffering capable display controller is used.
- The 64 kB frame buffer could reside in a dualport onchip RAM. This would give a very fast frame buffer.
Profiling
The DOOM demo mode is very similar to real game play, the demo mode is normal game play but with key strokes recorded. Demo runs will be deterministic. The profiling has been run using the command line:-timedemo demo3
(Again, taken from this. They are running the original DOOM port. The ZPU port is ported from BOOM.)
Time spent in each frame as FPS. The workload varies much between frames. |
Frame 1137, 76 FPS. Fastest frame. No time spent in ceiling or floor. Little time spent in drawing walls. |
Frame index 184, 14 FPS. Slowest frame. Way more fixed point multiplications than surrounding frames, not sure why. |
A typical frame takes about 7400000 cycles with the ZPU platform described above. Stairs, enemies, shooting and transparency are some things that will increase render time. The table below contains a breakdown for frame 200 of the demo run, this frame is good typical frame. The frame takes 9562694 cycles to complete.
Function Name | Total cycles spent | Nbr calls | Avg time in call | Description |
---|---|---|---|---|
memcpy (all) | 1414058 | 5356 | 263 | All memcpy calls |
memcpy (framebuffer) | 224504 | 1 | 224504 | memcpy to framebuffer |
udivsi3 | 2753956 | 1578 | 1745 | long long div |
muldi3 | 2073756 | 2322 | 893 | long long mult |
R_Drawspan | 837281 | 161 | 5200 | Draw ceiling/floor |
R_DrawColumn | 647947 | 867 | 747 | Draw walls |
There seems to be something fishy here.The majority of time should be spend inside the R_DawSpan and R_DrawColumn functions, these are the functions that does all the pixel drawing. Instead most time is spend doing 64bit arithmetic.
(See http://fabiensanglard.net/doomIphone/doomClassicRenderer.php for some interesting profiling results).
Frame 200. Used for function call profiling. |
udivsi3 & muldi3
DOOM uses 64 bit multiplications as part of their fixed point arithmetic. Many old PCs lacked an FPU I believe this is the reason why DOOM is integer only.#define FRACBITS 16
{
return (fixed_t)((long long) a*b >> FRACBITS);
}
__inline__ static fixed_t FixedDiv(fixed_t a, fixed_t b)
{
return (abs(a)>>14) >= abs(b) ? (a^b)<0 ? MININT : MAXINT :
(fixed_t)(((long long) a << FRACBITS) / b);
}
There is plenty of room for improvement in these routines. A good approach would be to do a assembler optimized routine. Better but not portable would be to use a dedicated hardware for the fixed functions above. FPGAs usally has a lot of multipliers in them that can be used for the multiplication. Many FPGA vendors supply IPs for dividers, there are also some available on opencores.com
These hardware blocks would then be memory mapped, this would be easier than adding a new instruction to the instruction set. The hardware accelerated multiplier would be very quick the divider a bit slower.
Including function overhead the FixedMul could take less than 50 cycles and the FixedDiv less than 100 cycles, but this is just a very wild guess.
memcpy
If the Fixed functions are improved then the majority of the time spend in memcpy will disappear.The memcpy complexity is approximently:
cycles_in_memcpy = 200 + bytes_to_copy *3.5;
Any small copy will have a pretty large overhead and large copies aren't that fast either. I am having trouble locating what memcpy being used, the only one I can find is this (zpugcc/toolchain/binutils/libiberty/bcopy.c):
void
bcopy (src, dest, len)
register char *src, *dest;
int len;
{
if (dest < src)
while (len--)
*dest++ = *src++;
else
{
char *lasts = src + (len-1);
char *lastd = dest + (len-1);
while (len--)
*(char *)lastd-- = *(char *)lasts--;
}
}
When looking at the assembler the doesn't look very similar, however GCC might unroll the loops.
The frame buffer memcpy takes about 2-3% of a frame, this could be removed completely by using hardware double buffering.
I have seen several post about optimized memcpy for ZPU, has anyone of these reached the repository?
R_DrawColumn & R_DrawSpan
These are the functions that most time should be spent into. These functions draw the pixel values to the framebuffer (via a buffer). R_Drawcolumn draws the walls and R_Drawspan draws the floors and ceiling, (see http://fabiensanglard.net/doomIphone/doomClassicRenderer.php for a good explanation)This is one of the innerloops inside R_Drawcolumn:
do
{
// Re-map color indices from wall texture column
// using a lighting/special effects LUT.
// heightmask is the Tutti-Frutti fix -- killough
*dest = colormap[source[frac>>FRACBITS]];
dest += SCREENWIDTH;
if ((frac += fracstep) >= heightmask)
frac -= heightmask;
}
while (--count);
Each iteration takes about 30 cycles and the count value varies greatly but for this particular frame the average value is about 20. So around 600 cycles could be saved.
The R_drawspan has a similar but not identical inner loop that would also benefit greatly from hardware acceleration.
There is plenty of room for optimizations, software only optimizations would give a nice bump but HW acceleration is needed to reach 35 FPS on a FPGA board clocked at 50 MHz.
OP code histogram
The emulator has a nice opcode histogram feature, this is useful to get an understanding of the workloads.A first guess might be that a 3D game would have a pretty characteristic profile when it comes to what instructions being used most. One might suspect lots of multiplications and divisions, but this is not the case.
Scroll down for a complete list of the opcode histogram. This histogram calculated from the entire run not just frame 200.
Here are some comments regarding some of the instructions:
loadsp / storesp / addsp
The most common instructions are what I call the stack operation instructions these are loadsp, storesp and addsp. I use to think of them as a necessity due to the lack of registers, or they being the equivalent to registers in a register based architecture. In most workloads these will always be among the most common instructions.This is a typical example:
count = dc_yh - dc_yl;
32798: 73 loadsp 12
32799: 76 loadsp 24
3279a: 31 sub
3279b: 55 storesp 20
The register based architecture would have something like:
sub r20, r24,r12 // substract r12 from r24 and store in r20
load
32bit load is a very common instruction at 12.5%. 32bit loads are used for accessing arrays and global variables. There are many global variables in the code. These are often used in combination with the im instruction:fracstep = dc_iscale;
32532: af im 47
32533: 99 im 25
32534: c8 im -56
32535: 08 load // load from bccc8
00032536 <.LM6>:
frac = dc_texturemid + (dc_yl-centery)*fracstep;
32536: 79 loadsp 36
32537: af im 47
32538: b0 im 48
32539: a4 im 36
3253a: 08 load // load from bd824
3253b: 31 sub
3253c: 71 loadsp 4
3253d: 29 mult
3253e: af im 47
3253f: a0 im 32
32540: 80 im 0
32541: 08 load // load from bd000
32542: 05 add
00032543 <.LM7>:
register const byte *source = dc_source;
32543: af im 47
32544: 99 im 25
32545: c4 im -60
32546: 08 load // load from bccc4
It would be nice to have something like a frame pointer and then a relative load.
im 47 // set fp to at bccc4
im 25
im -60
popfp
fracstep = dc_iscale;
im -1
loadfprel // load from bccc8
frac = dc_texturemid + (dc_yl-centery)*fracstep;
loadsp 36
im 5
im 80
loadfprel // load from bd824
sub
loadsp 4
mult
im 1
im 79
loadfprel // load from bd000
register const byte *source = dc_source;
im 0
loadfprel // load from bccc4
im -1
loadfprel // load from bccc8
frac = dc_texturemid + (dc_yl-centery)*fracstep;
loadsp 36
im 5
im 80
loadfprel // load from bd824
sub
loadsp 4
mult
im 1
im 79
loadfprel // load from bd000
register const byte *source = dc_source;
im 0
loadfprel // load from bccc4
(To clarify, there is no such function. But it might be a good addition if there ever is a ZPUv2)
loadb
Most pixel operations are done using char*, these will in most cases use loadb. It is surprising to see that loadb is only 1.7% given that this instruction is executed a lot in the inner loops of R_Drawxxxx functions.Opcode histogram for all frames
im | 32.99% |
storesp | 12.77% |
load | 12.49% |
loadsp | 8.39% |
add | 9.35% |
addsp | 4.22% |
neqbranch | 3.17% |
store | 2.79% |
loadb | 1.76% |
eq | 1.39% |
and | 1.28% |
storeb | 1.16% |
lshiftright | 0.98% |
ashiftright | 0.61% |
not | 0.56% |
sub | 0.54% |
lessthanorequal | 0.53% |
mult | 0.51% |
poppcrel | 0.51% |
lessthan | 0.50% |
nop | 0.47% |
pushspadd | 0.36% |
ulessthanorequal | 0.35% |
ulessthan | 0.35% |
popsp | 0.33% |
or | 0.33% |
poppc | 0.21% |
ashiftleft | 0.16% |
loadh | 0.15% |
callpcrel | 0.15% |
pushsp | 0.06% |
call | 0.05% |
storeh | 0.05% |
flip | 0.02% |
xor | 0.01% |
neg | 0.009% |
What is next?
It would be nice to port this to FPGA. Preferably to an already existing enviroment. Maybe the ZPUIno?
Some things that the platform needs to handle:
- Code size is ~700kB
- File data is ~4 MB
- 4MB RAM (either onchip or offchip+cache)
- Display output (VGA/HDMI, preferably 6bits per plane)
- FPGA resources for a Mode 13 frame buffer