Fullscreen 2 ( DRAFT 4 ) Matthew Allum Opened Hand Ltd mallum@openedhand.com 2005 OpenedHand Ltd
Introduction This report builds on the original fullscreen blit benchmark tests on handheld ARM based devices. The focus is moved to font glyph rendering speeds via different mechanisms, image blitting via GDK and the original tests on a newer 2.6 kernel. Graphics output is assumed to be by means of writing data to a 'dumb' kernel framebuffer device via direct means or an XServer. All glyph rendering is done via an 8 bit mask rather than a sub pixel 8888 mask. This option is set via fontconfig local.conf or by not having the XServer advertise LCD pixel order.
Tests For the tests simple test programs were created. They are written in C. the initial tests written are as follows. As well as the original tests, the following new tests have been created; test-fb Performs blits directly to the raw framebuffer device ( no X ). From the original tests. test-x Performs blits to an X window via MIT-SHM shared memory X Images. From the original tests. test-gdk Performs blits via GDK-pixbufs on X. Blits are performed to a GTK drawing area widget with double buffering turned off. This makes the test comparable to the others as they perform no double buffering. A test-gtk-idle was also later created with double buffering turned on. test-freetype Renders to lines of glyphs to the framebuffer using the freetype library. The original version generated glyphs per glyph blit, an improved version was then created which pregenerated ( 'cached' ) the glyph bit masks. test-xft Renders lines of glyphs to an X window using the Xft2 extension. test-pango Renders lines of glyphs to an X window using the Pango-Xft library. No pango layout or GTK functionality is used. test-pango-layout Renders lines of glyphs to a GTK drawing area ( with double buffering disabled ) via Pango layouts. GTK/GDK must be used as only versions of pango < 1.8 expose layout functionality to 'raw Xft'. One layout per line is used. Note all font based tests take similar arguments to specify what text is rendered ( run tests with -h to see ). By default Vera Sans fonts is used at 18 points with 20 lines of the ASCII alphabet ('a' through to 'z') being rendered 200 times.
Test Platforms The tests were run on the following platforms; Sharp Zaurus c760 ( Husky ) CPU: XScale-PXA255 rev 6 RAM: 64MB Display: 640x480x16 LCD GFX Chip: ATI IMAGEON W100 X11: Freedesktop.org kdrive Xfbdev server kernel: 2.6.11-rc2-openzaurus ( softfloat ) Ipaq 5500 CPU: XScale-PXA255 rev 6 RAM: 128MB Display: 320x240x16 LCD GFX Chip: MediaQ X11: Freedesktop.org kdrive Xfbdev server kernel: 2.4.19-rmk6-pxa1-hh37 Ipaq 3850 CPU: StrongARM-1110 rev 9 RAM: 128MB Display: 320x240x16 LCD GFX Chip: None X11: Freedesktop.org kdrive Xfbdev server kernel: 2.4.19-rmk6-pxa1-hh37 IBM Thinkpad T40p CPU: x86 Pentium M 1600Mhz RAM: 1Gig Display: 1400x1050x16 LCD GFX Chip: ATI Radeon XFree86 4.3 kernel: 2.6.9
Platform Notes All ARM machines have the same version X Server and X library's. Both of which are from recent checkouts of the freedesktop.org cvs kdrive source. In all of the above cases no hardware acceleration was used. The display is also running in its 'natural' orientation. The c760 device is very similar hardware wise to that of the c700, except having a larger battery and increased internal flash storage. The binaries built on the c760 are built using the soft-float floating point emulation provided by newer gcc's. This is reportadly supposedly much better performing than kernel 'hardfloat' floating point performance. The Thinkpad is x86 hardware and has an XFree86 accelerated server.
Benchmark Results
Blit Results Test Results Device test-fb test-x test-gdk test-gdk-idle c760 12177 KB/Sec 11015 KB/sec 6163 KB/sec Not Run Ipaq 5550 7425 KB/Sec 6412 KB/sec 5184 KB/sec Not Run Ipaq 3800 30241 KB/Sec 23547 KB/Sec 11144 KB/sec 10885 KB/sec Thinkpad T40p 137896 KB/Sec 370451 KB/Sec 317215 KB/sec Not Run
Blit Discussion We see no marked improvements on blit speeds since previous tests with results much the same. This is to be expected though as no major developments have happened in this area since the tests were last run. The c760, however, is using a 2.6 kernel and performance has actually degraded. This is not too much of a worry though, the 2.6 kernel on the c760 is very immature and the performance degration has been reported to the fb driver author. The fb driver is in fact a rewrite of the 2.4 driver without access to the display chips technical details. The 5500 results are very odd, its seems actual framebuffer access is slow during heavy blits but actual font rendering was very fast in comparison. The fb driver lacks any acceleration functionality provided by the mediaq chip. Could it possibly be the driver or hardware imposes some kind of bottleneck under heavy load that is causing strange results ? The same results appeared even after a second separate run of the benchmarks. The 3800 is fastest of all ARM devices with direct access to the display. It has no graphics chip driver. The linux support for the hardware is very mature when compared to the other two devices. The CPU however is the slowest. GDK pixbuf blits take a further large speed hit over pure X MIT-SHM blits. A reason for this could be the pixbuf internals having the extra work of rounding down from 24bpp RGB to 16bpp RGB before blitting to the server. Interestingly this difference is not as large when run on an X86 system. Could there perhaps be a more serious issue with GTK on ARM ? This needs further investigation. Version 2.4 of GTK was use for the tests which apparently does not suffer the previously reported SHM bug. The GTK blit test disabled the internal double buffering on the drawing area widget ( via gtk_widget_set_double_buffered(FALSE) ) to make the test similar to that of other fullscreen blit tests which use no double buffering. GTK double buffering working in such away that the widgets visible window is replaced with an off screen pixmap before its expose() handler is called, on returning from this handler the pixmap is copied to the visible window. To accomplish a similar test with double buffering the blit must happen else when in the code so the double buffering expose mechanism can still take place. It was therefor placed in an idle handler which after blitting would trigger the expose handler. Such a test was created ( test-gdk-idle ) and the results, from Ipaq 3800, were just slightly worse. Any performance loss is likely due to the frequency of the idle handler getting called. This assumes the cost is moving the pixmap from off to on screen is made up by the time save blitting to an off screen pixmap. On X86 test-x is 3 times faster than test-fb, this is the effect of having an accelerated server.
Glyph Results Test Results Device test-freetype test-freetype-cached test-xft test-pango test-pango-layout c760 1156 glyphs/sec Not Run 9386 glyphs/sec 6712 glyphs/sec Not Run Ipaq 5550 1711 glyphs/sec Not Run 18991 glyphs/sec 12192 glyphs/sec 5823 glyphs/sec Ipaq 3800 957 glyphs/sec 25304 glyphs/sec 17937 glyphs/sec 11458 glyphs/sec 6778 glyphs/sec Thinkpad T40p 28904 glyphs/sec 28812 glyphs/sec 16634 glyphs/sec 15384 glyphs/sec 15298 glyphs/sec
Glyph Discussion With pregenerated glyph's freetype is fastest, then Xft. The plain pango line rendering is approximately 30% slower, with pango layout rendering being approximately a further 30-50% slower. Although total speeds vary between each platform, the fraction of difference in speed between each test type stays approximately the same ( though this is not so true on Thinkpad ). The Thinkpad results, though fast, are slower than expected when compared to blit speeds on both framebuffer and X. I am not sure why this is. The non cached freetype test is much slower than expected on ARM platforms. On a desktop X86 system the results are much improved with speeds as expected greater than that of Xft. The reason for the low performance on arm is likely the lack of any glyph bitmap caching per glyph render and the bitmap generation using much floating point. This proves that Xft is caching glyph bitmap generation and it is definetly required for acceptable performance. To further improve on this a version of test-freetype ( test-freetype-cached.c ) was created that pregenerated glypth bitmaps in a simple cache before painting them. Running on the Ipaq 3800 gave much improved performance and an initial 'cache generation' time of 1159 ms. It should also be noted that the test-freetype test very crudely renders just the 8 bit mask to the display ( all bits > 0 are blitted ). No subpixel or even basic anti-aliasing was performed. test-pango writes text via the low level pango Xft calls to render lines of text to an X window. No GDK/GTK calls are used. To investigate the overhead of rendering to a GTK widget and window two further tests were created - test-pango-gdk to a GDK Window and test_pango_gtk - to GTK drawing area. Benchmarks from these on the 3800 were approximately equal. Another test was created using gdk_draw_glyphs() instead of pango_xft_render() again results were comparable - indicating draw_glyphs is just a wrapper around pango_xft_render(). test-pango-layout uses the pango layout API to render onto a GTK drawing area - most GTK widgets use layouts. There is an overhead involved, and this could be worse if we were rendering more than just a simple line. Though one would expect a performance improvement if a single layout was used for all text rather than a layout per line.
Improvements and Future Directions Some ideas for future tests. Investigate gtk slow blits more fully. Create a pango test with all lines in a single layout. Investigate slow glyph speeds on X86.
References Test Source Code Freetype.org Pango Xft/Fontconfig Bitstream Vera fonts