Saturday, February 21, 2009

Blending Vertex Normals In A Quad Tree

Ogre 3D Vertex Normal Blending from petrocket on Vimeo.

I've finally had some time to work on blending the mesh normals and can post a video - my FIRST video of the app so go easy - and talk about what works about it and where the idea came from.

As it turns out there is/was another excellent planetary LOD program based on geo clipmaps and I read that Lutz Justen (the author, now an engineer at Red 5 Studios!) had blended his normal maps and done some other fancy things like cloud map shadows. I contacted Lutz and he responded with this:

"Vertex normals: I don't think there is *any* way to make vertex normals not pop - UNLESS you change the way geo morphing works. You probably interpolate a vertex to the position between the two higher-level verts, i.e.
/ | \
/ \/ \

The upper vertex is interpolated to the average of the lower 2 vertices. In that case, you'll always have lighting pops. If you instead interpolate the upper vertex to either the left or the right vertex, you'd get rid of the pops because the limit of the finer triangles as you interpolate to the coarser triangles actually match the coarser triangles. I hope that makes sense. If you interpolate to the average of the lower vertices, the limit doesn't match the coarser representation. It's the same geometry, but a different triangulation.

If you store per-patch normals as Ysaneya does, there's another way. You can store your normal maps in object space (instead of tangent space) because they are unique anyway. Then you can lerp between a fine-level normal map and the corresponding coarse-level normal map. That'd make your lighting completely smooth. That's what I did in my demo, although the resolution of the normal map was very low - it was the same resolution as the vertices! But nevertheless, the lighting was completely continuous."

And that's what I tried to do, only right now my normal map calculations for the "coarse" map are not entirely correct because there is still a pop when the mesh level changes - although a MUCH smaller pop than before, it is still there as you can see in the video. Maybe sometime I will post a video of the transitions without normal blending so you can see the improvement.

Here's how I implemented the normal map blending in Ogre (cooking directions style!)
6 quad trees
4 quad tree meshes per node
2 sets of normals per mesh (for blending)
1 blend start distance for each level/depth in the quad tree
1 blend end distance for each level/depth in the quad tree
1 GLSL vertex shader to take these values and do something with them

1. prepare your meshes by creating the 3d vertex positions from your 3d noise function
2. calculate your end normals based on those positions and save them - these are the normals we will blend to.
3. get the outer edge vertices for the "coarser" mesh
4. calculate the normals for each vertex that exists in the "coarser" mesh
5. calculate all the interpolated normals for each vertex that exists in the "finer" mesh but not in the "coarser" mesh.
6. in the vertex shader blend between the "finer" and "coarser" normals based on the camera's distance to the vertex. Each level of the quad tree will have a different distance based on the split distance for that level - the split distance is the distance from the camera to a node at which it must split into 4 sub nodes, one for each quadrant.

Right now my calculations are not completely correct for step 5 because there is still a visible transition or "pop" when the finer mesh is displayed. I don't know where my calculations are wrong, but if I figure it out I will post the solution.

Saturday, February 14, 2009

Multithreaded Graphics Programming Caveats

You can't make graphics calls in any thread but the main thread without sacrificing platform portability and graphics library portability. In my case I want to be able to run this application on Windows in DirectX 9+ and in OSX and Linux in OpenGL. OpenGL allows you to make graphics calls in a secondary thread if you copy and initialize the graphics context in that other thread, but even if I did that I would have to write separate procedures to initialize the secondary threads for OpenGL and DirectX and none of it would be pretty and from what I've read the gains wouldn't be that significant.

So my compromise is to use POSIX threads and OpenMP for the non-graphic call related stuff - like the heavy lifting math in the terrain generation and normal maps etc. All the graphics related calls will live in the main thread (creating vertex/index buffers and populating them with data, and releasing vertex/index buffers).

I've noticed that creating an Ogre VBO and populating it with data can take up to 700 microseconds (but averages around 165 microseconds), while deleting an Ogre VBO takes 10 times as long (max 8 milliseconds, avg 1 millisecond). By only creating or deleting one VBO per frame, and after moving all the vertex calculations to separate threads, I'm able to keep things pretty smooth. Did I mention I'm using my MacBook Pro? It has an Intel Core II Duo 2.4ghz with 4GB of RAM and the NVidia GeForce 8600M GT with 256MB VRAM.

At some point I will need to identify why freeing the buffer is taking so much more time, and try to optimize that.

Here's the biggest caveat with threads that I've run into: I cannot reliably allocate memory inside a child thread. Ogre uses nedmalloc now and it is a fantastic allocator/deallocator. Unfortunately when I try to allocate memory from within a thread other than the main thread I get an EXC_BAD_ACCESS signal from within nedalloc::GetThreadCache() The annoying thing is that it doesn't happen right away - it's only after the program generates enough meshes. So for now I'm steering clear of any allocation/deallocation inside threads and keeping that in the main thread. It doesn't look like keeping it in the main thread is causing any performance hit anyways.

Thursday, February 12, 2009

POSIX Threads And OpenMP Can Co-Exist

Not exactly a major revelation, but as an experiment I changed the mesh build process to a threaded model using POSIX threads (pthreads). The short version is that when the quad tree changes, first a manager thread is spawned, which spawns several worker threads. The worker threads build the actual mesh and then quit when there are no more meshes to build. The manager thread just waits for all the worker threads to finish then updates the render queue with the new list of meshes and quits.

One important thing to mention, besides all the mutexes to protect the lists, is that the quad tree is not allowed to be updated while the new meshes are being built. Changing the quad tree while still building would invalidate our list of meshes to build and could get us into a situation where the render queue is never updated because the list of meshes to build is never completed!

After the meshes are all built and the new visible list of meshes is given to the render queue then the quad tree is allowed to update, and when the tree changes the whole process begins again.

Also note that while the worker threads are building the new mesh, the render queue continues to display the meshes from the previous version of the quad tree.

The second part of the experiment was to use OpenMP to utilize the extra core of my processor in building the meshes using work sharing. Fortunately, XCode supports GCC 4.2 which supports OpenMP 2.5, and that will suffice for now. I only added parallel processing to the for loop that generates the vertices so I don't expect to see much improvement, however when I start generating normal maps I think the improvement will be noticable. Anyways, it compiles and runs.

Next I have to fix some logic bugs, re-implement the code that deletes old meshes in the cache list, and either re-implement the heightfield and normal generating, or take a crack at a second quad tree that is updated based on where the camera will be so those meshes can be pre-built and put in the cache.

A small aside - making the build process threaded made my simple profiler not completely reliable anymore because the thread may start sampling during one frame and then before it is done sampling, the profiler might attempt to process the samples because that part of the profiler runs in the main thread! Fortunately that is unlikely to occur and if it does it shouldn't skew the results much, though I suppose I could add a semaphore to resolve the issue.

Tuesday, February 3, 2009

Quadtree Mesh Management And Caching

Sometimes you can get away with half decent code when dealing with 2d terrain, but you will be punished if you try that with spherical/planet terrain.

Now that my feet are wet I've returned to the basics in order to deal with some of core issues like implementing a mesh manager to handle meshes should be visible, those that need to be built, and those are built that should be cached. I also created a customizable QuadTreeSphere class that I plan on using for the water mesh, the atmosphere and the planet surface. I now can specify the min and max node levels allowed in the quadtree, the patch size for each mesh, the max cache size for each mesh and things like whether the mesh will use normals or textures or tangents or even a heightmap.

The new system doesn't do anything but basic terrain now because I've been focusing on the mesh management and caching.

Mesh Management

I've implemented a separate manager for each QuadTreeSphere that has four std::map objects that hold pointers to the mesh objects and the key for each element in the map is the unique ID for that mesh.
1. A visible map that holds all the current visible meshes.
2. A visible build map that holds all the unbuilt meshes that are needed for immediate display.
3. A cache map that holds all the non-visible, but built meshes that we might need soon.
4. A cache build map that holds all the unbuilt meshes that we might need soon.

I also have a FIFO (first in first out) list that contains all the id's of all the meshes in the cache map. If the FIFO list ever gets larger than the max cache size then I start removing meshes from the list - and from the cache and deleting them. This list is used to delete the least recently used meshes. Of course if any mesh in the cache is moved to the visible map then it must also be removed from the FIFO list, and when that mesh is moved from the visible map to the cache map it will go on the front of the list.

Some other rules I have found that I needed to implement:
1. All meshes in the visible build map must be built immediately or you get flickering where you have not built a mesh (duh)
2. Only build one mesh from the cache build map per frame and only if no meshes in the visible build map were built because building meshes takes a lot of time (relatively).

I added a simple profiling class because I know I will need to compare build optimizations and also I needed to get a good idea for how much time each major component takes. Clicking on the chart below takes you to the larger version.

Profile chart

Quadtree Pre-Caching

My simple manager has a bottleneck right now and that is when the camera gets close to the surface all these new meshes are added to the visible build list and must be immediately built for immediate display. There are a couple things you can do to resolve this issue and the simplest one is to do pre-caching. My simple method is to take the closest quadrant from the closet node at the current deepest level of the tree and pre-cache the four meshes for that quadrant (which we will have to split anyways if the camera get's closer to the surface). The second simple thing to do is to pre-cache the meshes for the neighbors of the closest node at the current deepest level of the quadtree (if those neighbor nodes don't already exist).

Below is a diagram of this:
Quadtree Pre-Caching Diagram

The Pink square represents the closest node to the camera at the current deepest level in the quadtree. The four orange squares are the neighbors of this node.
The purple square represents the quadrant that is closest to the camera in the pink square.

Under my simple caching scheme we know that if the camera moves closer to the surface we must split the purple quadrant - so the first thing to do is to build those meshes and save them in the cache until they're needed. If the camera moves left or right along the surface but not closer to the surface then we'll need the meshes represented by the orange squares, so build those and add them to the cache.

If the camera moves further away from the surface the larger meshes should still be in the cache because every mesh that was visible is added to the cache after it is not needed.

Pros of this scheme:
1. Simple and works without modifying the quadtree
2. This caching scheme works fine when the camera is tracking a player running on the surface, where they can change direction and speed rapidly and frequently.

Cons of this scheme:
1. If the camera moves in a diagonal fashion along the surface, the diagonal nodes will not have their meshes in the cache (green nodes)
2. This scheme is ignorant of the camera's current speed and direction, which is more useful information when dealing with fast moving objects above the surface, where the direction and speed won't change as rapidly.

I plan on implementing another cache scheme based on speed and direction where I have a second quadtree that represents where the camera will be in the next nth of a second, and based on that tree, cache all the meshes that would be visible. According to my profiling statistics, the updating of the quadtree doesn't take nearly as much time as building meshes, so it might be worth the extra time to implement this pre-cache scheme and get better cache results.

Lastly, I implemented frustum culling for my barebones quadtree system and it helps the FPS a lot! It cut down the number of triangles from about 200k to about 90k.

Frustum culling

I plan on doing some experiments with making the mesh manager threaded to better utilize the cpu.