Stay in Touch with Us

Pages

Friday, May 18, 2012

Developer's notes about OpenGL ES programming

 

Introduction


During development of our live wallpapers we have tried various scenarios of using OpenGL ES and have used many optimizations of both performance and visuals of our live wallpapers. In this article I will try to tell what problems can developer experience while developing for OpenGL ES on Android.


Early days. Using OpenGL ES 1.0

First we used jPCT-AE game engine to create 3D live wallpapers. Clock, Christmas and HTC live wallpapers use this engine. It provides good abstraction, easy way to load models, but its pipeline is not flexible enough because it uses OpenGL ES 1, which means no shaders support. Latest versions of this engine support OpenGL ES 2.0 but anyway all our latest live wallpapers use custom framework now.
While jPCT-AE is a great engine it has some limitations which we believe make it unsuitable for live wallpapers and have forced us to develop our own framework:
1. OpenGL ES 1 - say goodbye to shaders. As I have already mentioned, new versions do support shaders.
2. Slow loading time -- people reported that on low-end phones it took forever to start up live wallpaper.
3. Buggy switching on/off - when OpenGL context is being re-created it causes some loss of textures. We were not able to find a cause of this.

Using OpenGL ES 2.0

What forced us to switch to custom OpenGL ES 2.0 framework is lack of programmability of OpenGL ES 1. For Rose Live Wallpaper we had to use alpha-testing and it couldn’t be implemented without shaders. We tried to use basic blending but it required sorting of polygons and it was a weak side of jPCT-AE. Its sorting algorithm was simply not precise enough and order of polygons was always broken.
To use OpenGL ES 2.0 we had to implement everything from scratch - initializing OpenGL ES 2.0 context, loading of textures and models, drawing routine. And to learn what OpenGL shaders are. It is true that OpenGL 1 and 2 are two absolutely different things. Only about 1% of old code was reused in OpenGL ES 2.0 engine.
OpenGL ES shaders provide you almost infinite possibilities to create any effect you see in modern PC engines. Latest mobile games (not only Android, but iOS also) that use shaders support dynamic shadows, reflective water, motion blur, depth of field, distortions and all other stuff you can see on PC.

Tools of Trade

We use ATI’s Rendermonkey for shader programming. It has OpenGL ES 2.0 support, and gives you immediate preview of changes. It is useful for both coding shaders and messing with parameters to achieve the best look.
You can download Rendermonkey here, it’s free: http://developer.amd.com/ARCHIVE/GPU/RENDERMONKEY/Pages/default.aspx.
For GeForce video cards, you’ll need to use ‘nvemulate’ tool to use Rendermonkey (you’ll need to do this only once, though). Launch it and change ‘GLSL Compiler Device Support’ option to ‘NV40’. Download it here: http://developer.nvidia.com/nvemulate.

 

Fight for performance

Android phones have very limited GPUs. You can’t increase infinitely polycount, texture sizes and complexity of shaders. It is very easy to reach limit of fill-rate. It is much harder to optimize application correctly - to provide decent performance without reducing image quality.
We always test our live wallpapers on low-end devices. If new app doesn’t run at acceptable frame rate on Nexus One we can’t publish it to Google Play.

Compressed Textures

To be short, always use compressed textures. The impact on performance is huge. In our apps, usage of compressed textures gives about 30-50% increase of performance. And this is almost free. Just compress your textures and you have 30% faster rendering.
In modern world of mobile GPUs we have a lot of various texture compression methods. PowerVR uses PVRTC (you can find a lot of information about this compression in the articles about OpenGL for iOS). Qualcomm suggests to use ATITC textures, and nVidia encourages us to use a well-known set of DXT compression algorithms. So which one should you use? The only texture compression format supported by all Android devices is ETC1. This is the only official OpenGL ES texture compression format which definitely works on any Android device which supports OpenGL ES 2.0. ETC1-compressed textures take 6 times less GPU memory and that's why the fillrate increases significantly. The major drawbacks of this compression are quality loss and lack of alpha channel in ETC1 textures. In some cases, quality loss can be quite noticeable. The lack of alpha channel, however, is not a big deal, I will cover the ways of bypassing this limitation later.
DXT texture compression, supported by nVidia Tegra GPUs, is a derivative of ETC1 and thus it doesn't provide good image quality, this is noticeable on gradients. PVRTC achieves the best image quality with great compression, and it also supports alpha channel.
Of course if you can you should always look for the best texture compression supported by current device and use it, but for us using ETC1 works quite well in 90% of cases. It gives significant increase of performance, and loss of quality in most cases is not very noticeable. Only in some cases we use PVRTC compression on devices supporting it. For example, sky texture in Wind Turbines looks too bad compressed with ETC1 or DXT, gradients are distorted too significantly. That's why  we use uncompressed texture on all devices, and PTVRTC for PowerVR chips.

Mipmaps

Mipmaps should be used in cases when you need to cover with one texture a large area with variable distanse to camera, such as terrain or long walls. If position of object is static, you should just pick the optimal detail of texture and don't waste GPU memory on storing mipmaps for this texture.
We used mipmapped ETC1 textures for terrain in Wind Turbines live wallpaper. Image quality change between texture with mipmaps and without is absolutely unnoticeable, especially taking into account the fact that distant parts are fully covered with fog. We use bilinear filtering - for minification we use linear filtering with mipmaps and linear for magnification:

GLES20.glTexParameterf(GLES20.GL_TEXTURE_2D, GLES20.GL_TEXTURE_MIN_FILTER, GLES20.GL_LINEAR_MIPMAP_NEAREST);
GLES20.glTexParameterf(GLES20.GL_TEXTURE_2D, GLES20.GL_TEXTURE_MAG_FILTER, GLES20.GL_LINEAR); 

   
What is interesting, in this particular case mipmaps haven't introduced any performance improvement on all test devices we have. Another interesting fact is that even though filtering is bilinear, on some GPUs partial dithering is applied between mipmaps of different levels. This gives slightly better image quality by making transition between mipmaps less noticeable. We believe this dithering is caused by uneven terrain slopes. The GPUs which adds this dithering are Tegra 2 and PowerVR. Qualcomm Adreno 205 doesn't add dither.
Anyways you should always use mipmaps where they are needed. This is considered as the best practice and really should boost performance.

Alpha Testing

For transparency, in OpenGL ES you usually use blending. Blending using alpha channel can be used for simple masked textures. The major problem with blending is that you have to draw polygons in certain order to achieve needed result. Sorting of polygons is a set of complex and slow calculations. Additionally, you cannot use VBO if you change order of triangles by sorting them. To achieve order-independent masking the best method is to use alpha testing. Alpha testing uses very simple shader - you just put one condition in 'if' operator and call 'discard;' if you need transparent color for this condition. Alpha testing writes to z-buffer so geometry gets culled even by partially transparent polygons. Usual blending doesn't occlude geometry and that's why it is slower than alpha testing.
The simplest condition to make masked textures is to discard pixels with a certain color. For faster shader, you can compare only one color channel.
The more complicated solution is to use two separate ETC1 textures for diffuse and alpha channels. Less memory will be used in this case compared to uncompressed texture with alpha channel, providing better performance because mobile GPUs can sample two textures in one pass. Usage of masking by certain color can lead to thin black edges on masked objects caused by texture filtration. Separate textures for diffuse and alpha channels will not cause this - this is covered in more details in section “PNG Troubles”. This method is used in Lantern Festival app for masked textures of vegetation.
Example 1 - alpha test by certain color in diffuse:
if (base.b > 0.5) { /* discard bluish colors */
  discard;
}


Example 2 - alpha test using separate mask texture:
uniform sampler2D sTexture; /* diffuse texture */
uniform sampler2D sMask; /* black and white mask texture */

vec4 mask = texture2D(sMask, vTextureCoord); /* sample mask */
/* Discard if mask is less than 50% white. For performance, we use only red channel */
if(mask.r < 0.5) discard;


Blending

Of course, for transparency you will need blending. Good example of blending usage is Candle Live Wallpaper. All shadows use blending mode (GL_ZERO, GL_SRC_COLOR) with black-on-white shadows. Because alpha channel is not used for this blending, these textures can be easily compressed. And because shadows don’t have to be very detailed (some of them are 64x64 and 32x32) they take not much memory and are fast.
Flame and quill also use blending, and because these objects overlap they need to be sorted. Very basic algorithm is used to sort them - just checking camera rotation angle.
You can pick blending mode that suits your needs by looking at this sample image: http://soup.nuthatch.com/post/62627763/blending-jpg-800-600-pixels.
When rendering transparent polygons, writing to depth buffer must be disabled. Remember, that transparent polygons never leave anything in Z-buffer. For shadows in Candle Live Wallpaper, we disable both writing to depth buffer and depth test to avoid Z-fighting artifacts. This is convenient for shadows on flat surface, but for other transparent objects like quill and flame in our example, polygon sorting must be used. In this case, depth test is enabled and depth writing is disabled. At first, you render all solid objects to fill Z-buffer with information about scene depth. Transparent polygons will read Z-buffer to be occluded correctly. Then, render transparent polygons in correct order - first the furthest, last the nearest. For Candle Live Wallpaper, we could easily determine this order based on camera rotation angle, but in other cases you might want to use more sophisticated polygon sorting.

Vertex Buffer Objects

In our experience, usage of VBO gave no performance boost. If application draws large amount of triangles, then sending them to GPU on each frame is very expensive. Our apps draw up to 3000 triangles each frame so we have decided that storing information about vertices in VBO will give a noticeable performance boost. This was in theory. In practice, in our case it gave no increase in performance. However, we always use VBO to draw static objects in order to reduce bandwidth load, and only for animated objects apps transfer all vertex data to GPU each frame.

Limitation of Two Texture Samplers

On most devices you will be limited to use only two texture samplers at once. Some mobile GPUs do have more texture samplers, but even such GPUs as Tegra2 have only two hardware samplers. If you use more than two samplers in shaders it will cause additional pass to render single frame, which significantly affects performance. To overcome this limitation you can put more information to single texture.
Here are some examples. In Candle Live Wallpaper, when adding specular highlights for candle, shader uses separately only red and green channels of specular map. Specular color is a constant defined in shader.
Excerpt of code:
 vec4 spec = texture2D(specularMap, vTexCoord);
 vec4 SpecularColor = spec.r * specular * Ks * pow( max( 0.0, dot(vReflect, vViewVec)), n_specular);
...
 baseColor += baseColor * abs(time_0_X) * spec.g;


Another example is Wind Turbines live wallpaper, where lightmap texture for terrain contains two parts - static lightmap and dynamic shadows cast by clouds. Shader picks 2 colors from different texture coordinates and then combines them with diffuse color of terrain.
Code excerpt:
vec4 shadow = texture2D(sLightmap, vTextureCoord_shadow); // static lightmap
vec4 base = texture2D(sTexture, vTextureCoord_base) * shadow * 2.0; // diffuse color
base = mix(cloudColor, base, texture2D(sLightmap, vTextureCoord_cloud)); // add moving shadow. Use sLightmap texture unit but with different texture coordinates

Resolution of FBOs

Many complex effects, like real-time reflective water or bloom require rendering to texture. This means that you will need to render scene twice - on screen and to off-screen framebuffer. The most popular optimization of rendering water reflection to off-screen framebuffer is simplifying of geometry. Lantern Festival live wallpaper uses this technique - small unnoticeable details are not drawn on water. All main objects in scene, however, are present in reflection. The other important part of improving performance of off-screen rendering is obvious - use frame buffer of lower resolution.
In order to speed up rendering to texture we have also tried to use simpler shaders which work faster and don’t draw some objects which caused significant FPS drop in main rendering pass. The most interesting in this situation is that simplifying of geometry, usage of fast shaders without complex effects and low-res textures without filtering gave neglecting increase of performance of 1 FPS. Reducing resolution of frame buffer object from 512x512 down to 256x256 and even 128x128 gives significant performance boost. Of course, reflective water with 128x128 texture looks not detailed enough, so we have picked the optimal resolution of 256x256, which provides good visual quality with acceptable performance.
Optimization of geometry has much less effect than simply reducing resolution of FBO texture. Smaller resolution of FBO means less rasterization, and significantly improves performance.
So our advice is to always pick the lowest possible resolution for render target textures.

Culling and Draw Order

This may sound too easy. Of course you use backface culling every time it is needed. And of course you draw transparent objects in correct order. To explain how important draw order of non-transparent geometry can be, I will provide a real example.
There was a problem with Lanterns wallpaper. It was almost done, but the framerate was below acceptable 25-30 fps. We had added the shader to draw clouds and it caused a significant loss of FPS. But it looked so good that we couldn’t remove it, sky wouldn’t be alive without it. And this shader was already highly optimized, there was no way to make it faster without crippling it. In fact, drawing sky was one of the slowest part of the whole scene.
We managed to run at acceptable framerate by changing draw order. Initially app rendered the whole sky sphere first and then the rest of geometry. We realized that this caused loss of performance. As we changed draw order to drawing terrain and other stuff first, sky got occluded by this geometry and GPU started to rasterise less than half of triangles of sky sphere. This immediately gave us acceptable performance.
Basically, always try to render near objects first, furthest last.

Balance Between Vertex and Fragment Shaders

To achieve the best performance, always effectively separate calculation between vertex and fragment shaders. If some values can be calculated per vertex there’s no need to compute them for each pixel in fragment shader, do this math in vertex shader.
Sometimes, you’ll have to find a compromise between better/precise results with fragment shader and better performance with vertex one. We had a good lesson with fog shaders in Wind Turbines and Lantern Festival live wallpapers. Rewriting from per-pixel fog to vertex one in Wind Turbines gave us 5-7 FPS in performance. As for Lantern Festival, vertex shader allowed us to make a lot of calculations without performance loss at all. In this case, fog is calculated not only from the distance, but also from the height from the ground. In addition, fog color is attenuated by light’s position - it is highlighted around the moon. This amount of calculations would effectively kill the performance if it was done in fragment shader.
If you decide to take the high road of vertex shaders, you must consider your geometry first. Data from vertex shader is passed to fragment shader linearly interpolated, and thus, vertices of your geometry can become evident. To avoid this, tessellate your models where needed. Use more polygons for better results. Objects like landscapes are usually evenly tessellated already and are good candidates for vertex shaders. As an example, we had to tessellate water sheet in Lantern Festival live wallpaper - without that, fog vertex shader was not applicable to the whole water surface. The amount of tessellation was really small - water sheet had only 40 triangles.

Complex Calculations in Shaders

Try to simplify math of your shaders, especially in fragment shaders. For example, we heavily optimized water shader in Lantern Festival live wallpaper using the following methods:
1. Removed clamp() call gave a few extra FPS. Instead we created water texture with needed clamping parameters.
2. Optimized math. Changing some formulas to simpler ones also gave some performance boost. We removed some unnecessary multiplications, those appeared to be the most critical operations in fragment shader.
3. Moved calculation of fog to vertex shader, and increased tesselation of water for better look.
Bacically, if performance of shader is not acceptable, you should experiment and strip any suspicious parts one by one narrowing search to the most critical part. The cause of low performance is not always obvious. Once the bottleneck is found you can optimize only this part of shader.

Other Useful Stuff

 

nVidia Coverage Sampling Antialiasing

nVidia’s Tegra 2 GPU supports coverage sampling antialiasing. I have always thought that antialiasing on mobile devices is way too expensive and will cause significant drop of FPS, but CSAA is fast. Actually, it doesn’t affect performance at all. So I’ve added code to initialize CSAA in all our wallpapers. Please note that this is proprietary feature of nVidia’s GPU so if you want to use antialiasing on devices with other GPUs you will need to fall back to initializing OpenGL ES with generic multi-sampling.
You can read about technical details of CSAA here: http://developer.nvidia.com/csaa-coverage-sampling-antialiasing

 

PNG Troubles

 

Alpha channel

PNG images have alpha channel, which can be used for both blending and alpha-testing. However, there is a problem with the way Android works with bitmaps. If alpha color for a given pixel of bitmap is fully transparent, it will be always read as black. This causes visual artifact with alpha-testing - a noticeable black edge on transparent objects, especially the ones with bright color. 
To eliminate this bug we have decided to use separate texture for alpha testing. Shader takes diffuse color from one texture, and alpha value from another. Both textures can be compressed using ETC1 (no alpha channel is used in both of them). Two compressed textures take less GPU memory and performance is even slightly better. You can even separately reduce size of either alpha or diffuse channel if needed.

 

Premultiplied colors

Another problem with PNG files in Android is that Bitmap stores images loaded from PNG with colors premultiplied by alpha value of each pixel. GLUtils.texImage2D() also uses RGB colors premultiplied by alpha, so you can't get original colors this way. But in OpenGL you often may want to use alpha channel to store additional information. For example, we use it to store specular map in alpha channel of normal map. This is very handy when you are limited to use only 2 texture samplers. Of course in this case we need original, not premultiplied RGB colors.
 In order to load PNG images without RGB channels being premultiplied we use 3rd party PNGDecoder and load texture with glTexImage2D(). You can get PNGDecoder library to decode PNG from here: http://twl.l33tlabs.org/#downloads.

Problems Caused by Differences Between GPUs

There is a vast variety of mobile GPUs. We try not to use any proprietary features of any GPU vendor. We use only standard commands in shaders, ETC1 texture compression supported by all devices, and don’t utilize any vendor-specific OpenGL ES extensions. We don’t use specific OpenGL configurations, always try to pick the one allowed for given device. The only proprietary feature we use is initialization of OpenGL context to support nVidia’s CSAA only on devices with Tegra GPUs.
However, this doesn’t mean that there are no problems caused by differences between GPUs. We have to test apps on all mobile GPUs available on market: Adreno, PowerVR, Mali400, Tegra. And every time we launch app on new GPU we are waiting for some unexpected behavior. Hereby I provide some examples of troubles you may encounter working with different GPUs:
1. Always set up texture unit for samplers, and use correct uniform attributes. Don’t rely on default values.
2. Shader commands. For example, it was very tempting to use texture2DLod command, but some GPUs don’t support it.
3. Shader logic. For example, Adreno GPUs crashes if you exit shader by calling discard; before sampling all samplers. Strange as it is, but we had to change alpha test shader to sample all textures first and then discard depending on value of alpha channel.
4. Very large values of texture coordinates passed to shader causes some distortions, and textures may appear without filtering.

Other Tweaks

 

Memory allocations of Matrix.rotateM() and Matrix.invertM()

In OpenGL you very often manipulate matrices. Don’t call Matrix.rotateM() or Matrix.invertM() on each frame - these methods makes an unnecessary memory allocations. It is better to make a method in your renderer class which will do the same math but without allocating extra memory. This will prevent lag caused by Dalvik garbage collection.
For example, if Matrix.rotateM() is used a few times on each frame, this causes the following performance loss on HTC Desire with Android 2.2 - GC causes 100 ms lockups each 5-10 seconds. This is really noticeable. Android 2.3 has concurrent garbage collection which reduces lockups to 1-5 ms which is unnoticeable, but anyways you don't need to allocate that much of memory just to rotate matrix.
More about this issue you can read (and find a sample code) here: http://groups.google.com/group/android-developers/browse_thread/thread/b30dd2a437cfb076?pli=1

 

Loading Geometry

In early versions of live wallpapers we were storing models in OBJ  format. It was OK to read small models from this format, but for models with nearly 1000 polys it became too slow, especially on low-end devices. There were complaints from users about huge loading times, also. That forced us to store models geometry in binary format, which can be passed to videocard directly. We made a command-line tool for this, which generates binary data. We can generate data for models with or without normals, UV channels, and even tangents/binormals, depending on shaders’ needs. For normal mapping, we do not calculate tangents and binormals with own tools. Instead, we use great Autodesk’s FBX importer for 3ds max (Maya’s one is capable to do this, too) - it has an option for generating them.

 

Animation

We have implemented very basic animation. It is vertex keyframe animation with linear interpolation between keyframes. No shaders are used in animation, it is calculated completely by CPU. As long as there are not very many keyframes it doesn’t use much memory and works fast enough.

 

Object picking

For picking objects in 3D space we have implemented ray picking method. It is fully based on this great article: http://android-raypick.blogspot.com/2012/04/first-i-want-to-state-this-is-my-first.html. It detects collision of ray coming from view position to given screen coordinates, translated into 3D world coordinates. The only modifications I've made to code are to get rid of excessive memory allocations. Once refactored to use pre-allocated arrays for calculation, it works like a charm.
One note regarding performance of object picking: use very low-poly collision models instead of checking ray intersection with every triangle of 3D objects. We haven't experienced any performance issues in finding intersections with models made of 900+ triangles, but it really unnecessary to have such detailed collision models for objects which you will pick with a finger touch on phone's screen.

Conclusion

Hopefully, these notes about problems we have encountered during development will help at least a few other developers to make better apps and avoid our mistakes. It is always better to learn on mistakes of other people, rather than submitting an app to Google Play and receiving negative comments about incompatibilities or bugs on various devices.


Downloads

You can download shaders used in our live wallpapers here: http://dl.dropbox.com/u/20585920/shaders.zip. Feel free to adapt and use them for your needs.
All these shaders are for RenderMonkey application which you can get from AMD site here.

15 comments:

  1. Great post, one of the best I've read for a while. Thanks for that :)

    One small question about using only two texture samplers. You mention that using more than two samplers is not encouraged because it will deny single pass rendering (due to hardware limitation). However, in the example code of wind Turbines you are using two samplers, but sampling them three times total. Isn't this method also forcing the code run two passes, or are you actually able to sample more if the number of samplers is limited to two?

    Thanks again!

    ReplyDelete
    Replies
    1. Thank you for good comment. We also use the same method in water shader - only 2 samples but one of them sampled twice. It works noticeably slower than sampling it once but shader with 3 texture samplers is even slower. In case of wind turbines terrain shader performance drop of sampling one sampler 2 times is quite small and acceptable. So this is a good compromise between performance and visual quality of shader.

      Delete
  2. This insights are very valuable! Thank you very much :)

    ReplyDelete
  3. Great article... I downloaded Lantern Festival for some reference and it looks beatiful on my Nexus 7. I really benefited from the the points made here. Thanks for taking the time to share... Hope to hear more!

    ReplyDelete
  4. Good article, although transparent polygons can and should be rendered with GL_DEPTH_TEST on, it's depth WRITING you want to turn off, not testing...

    ReplyDelete
    Replies
    1. Thank you for useful comment. You are absolutely right, I've updated article. We disable depth testing only to draw shadows, because otherwise z-fighting occurs.

      Delete
  5. Really useful and well-founded information. Thank you!

    ReplyDelete
  6. Wonderful post!

    I'm only beginning developing opengl live wallpapers and have an additional question:

    You decided to use 2 etc1 textures for images with alpha channel, but why didn't you use other texture compression formats which support alpha?

    I know that etc1 is most common, but still it seems possible to create several apk for each texture format and serve them using Google Play's multiple apk feature.

    Do you find using 2 etc1's just being simpler compared to all this multiple apk handling or are there any other potential problems with other compressions I should be aware of? :)

    ReplyDelete
    Replies
    1. We use ETC1 with external alpha because this texture format is supported by all Android devices with OpenGL ES 2.0. We do use other texture formats in cases where image quality is important, but all of these formats depend on GPU. For example, PVRTC (which we also use) looks almost as good as uncompressed texture, while DXT looks as bad as ETC1 only with alpha channel. So the main drawback of other texture formats is image quality - each of these compressions adds specific artifacts, so you should try it first and decide whether quality is good or not. Performance of all of these texture compressions is identical. We've tried a lot of different options and ETC1 (with separate ETC1 alpha where necessary) seems to be the best choice for 99% of cases.

      Delete
    2. Great! Thank you for clarification!
      Very valuable info for the beginner! :)

      Delete
  7. In my Nvidia GeForce GTX 560 Ti with latest drivers RenderMonkey didn't allow work with OpenGL ES shaders but OpenGL (Desktop) works fine. I also tried nvemulate tool but it also didn't help :(
    Any suggestions?

    ReplyDelete
    Replies
    1. Sorry, can't solve your problems with RenderMonkey. In our case it works OK. I can only suggest to play with nvemulate options.

      Delete
    2. Have tried Rendermonkey on new Fujitsu T901. Unfortunately, OpenGL ES emulation doesn't work on this hardware so have to work in OpenGL mode. Now trying PowerVR SDK instead.

      Delete