Tuesday, March 24, 2009

Resource Managers

I've been looking for a fast, safe, memory efficient implementation of a basic resource manager that does a few things:

  1. Uses handles instead of pointers

  2. Supports reference counting so resources can be shared and freed properly

  3. Doesn't fragment the heap, but instead, allocates memory in contiguous chunks so all the data in a resource "pool" is local

  4. Is templated and simple

  5. Releases resource handles by pushing them onto the back of the list and getting available handles off the front of the list so that the most recently used handles stay in the handles list longest for best cache results.


Here are the ones I considered in my research.

A Simple Fast Resource Manager using C++ and STL (Ashic Mahtab, Zinat Wali)

Pros:
Uses STL and templates, very simple easy to understand, uses reference counting and handles.

Cons:
Uses strings for unique identifiers, pointers to resources are stored in std::vector, not the resources themselves, uses a stack for handles (push and pops off the front).

Gem: A Generic Handle-Based Resource Manager (Scott Bilas)

Pros:
Uses STL and templates, simple to understand, uses handles, stores resources not resource pointers.

Cons:
Uses strings for unique identifiers, does not have reference counting for handles (handle sharing), does not use a queue for handles.

Ogre 3D Resource Manager

Pros:
Is built to work with Ogre resources, supports shared resources and handles (reference counting), supports loading states and threaded loading of resources.

Cons:
Uses strings for unique identifiers, does not use a queue for handles, stores pointers to resources.

All these resource managers are great for managing resources loaded from files, which is why they all use a string as an identifier. None of them were designed with storing on-the-fly generated terrain data in mind.

I ended up starting with Ashic Mahtab and Zinat Wali's manager code and modified it to use a std::deque for the handles, to use an unsigned long long as the unique identifier and to store the resources themselves and not resource pointers.

So after I coded that up I ran into some major problems:

  1. The Ogre::VertexData class has its constructor set private so I couldn't subclass it to easily make it a resource. The solution was to store a shared pointer to it instead, which means that I lose my resource memory locality.

  2. My vertices and blending vertices can be of variable size and my implementation was designed to store objects of known sizes (known at compile time). I could specialize the vertices manager class to handle this shortcoming, but I opted to just use a shared pointer instead.


Looks like I ended up basically storing pointers to resources instead of the resources so far!

The profile results show that I have brought the deletion time down from 251 microseconds average to about 5 microseconds average. Over-all the application update time and the _updateRenderQueue() function time seems to have decreased by at least half.

So I'm fairly happy with the results, the problem is that I'm still seeing spikes in the Ogre::Root::renderOneFrame(). The average time for that function is 4,000 microseconds, but the maximum time is around 40,000 microseconds.

In the future I will look into what is going on inside Ogre that is taking this time, but for now I'm going to continue on with my refactoring so this may be the last resource performance related post for a bit.

Friday, March 6, 2009

Shared Vertex Buffers Cont.

The plot doesn't thicken, it plods on like man in a desert holding a can of pudding and desperately looking for a opener. Actually, it isn't like that at all, but when it gets late you start to get more "poetic" and silly.

profile graph
Profile Data On Swivel! The blue line is the time taken by renderOneFrame() (an Ogre3D function). The yellow line is the time the application takes to update the quad tree and the brown is the amount of time within renderOneFrame() that is taken up by my application updating the render queue and creating/updating vertex/index buffers. The vertical axis is time in milliseconds and the horizontal axis is time in seconds. This chart only shows the main thread and does not include the builder threads that create the mesh data including the heightfields and normal maps.

I uploaded some profile data for the test application that runs in 3 different modes to Swivel, which is a free online data/graphing tool that is in "preview" aka beta. I needed to find a better graphing tool than iWork's Numbers, which takes 10+ minutes to make graphs of data with more than a handful of rows and columns ugh.

I've created a class that encapsulates all the functionality for managing the vertex buffers for the terrain. The class has 3 managing "modes". It can have all the meshes use a large single vertex buffer, or multiple vertex buffers. When using multiple vertex buffers it can either cache those buffers for later re-use, or immediately delete buffers that are no longer used.

When running in single buffer mode, or multiple shared buffer mode, every mesh that is in the visible list and the visible build list has space in the buffer. Every mesh in the cache list no longer has a guaranteed space in the buffer - however if the cached mesh is still around when we need it later the vertex buffer manager will check to see if that mesh's data is still in the vertex buffer somewhere.

The nice thing about sharing the vertex buffers is that it means I'm not deleting them like crazy, and because I create enough space to hold a good number of meshes when the program loads I'm not creating buffers like crazy (or expanding them in the case of the single buffer).

Conclusions:
Now that I have coded these 2/3 methods up, I need to do some further testing because so far the only definitive results are that all 3 versions exhibit random spikes in renderOneFrame() that I think are just related to panning the camera around which updates the frustum culling, and that the FPS drops from 310 to 290 when switching from STATIC to DYNAMIC vertex buffers. I plan on making the camera follow a specific path and output profile data for each method using that path so I can better compare the results. Right now it's just me flying down to the surface and then randomly visiting spots on the surface.

Wednesday, March 4, 2009

Shared Vertex Buffer vs. Multiple Vertex Buffers

I went ahead and implemented a shared vertex buffer for all the terrain patches and did some profiling and found that updating the shared vertex buffer takes 6 times longer than creating a new buffer and filling it. Also, when using a shared vertex buffer the fps drops from about 240 to 200 - I'm guessing this has to do with using a dynamic versus a static_write_only buffer. I tried setting the shared buffer type to HBU_DYNAMIC and HBU_DYNAMIC_WRITE_ONLY and used HBL_NO_OVERWRITE for my lock type when writing vertex data. I had 16 unique index buffers (for the 16 different index orders) and each patch used one of those 16 buffers so those didn't change - I just had to update the vertex buffer every time I wanted to add a new patch.

So, say for example that my program samples the longest amount of time it takes to create a vertex buffer and fill it - that took on average about 1 ms in my tests, then the time it took to lock part of an existing buffer (the shared buffer) and update it - that took on average 6 ms.

The funny thing about it all is that when NOT using the shared vertex buffer the program seems to "stutter" more - and I haven't nailed down why it does that, but my current theory is some kind of buffer management overhead in Ogre when I created/delete all these buffers, because I've profiled most of my code. When I use the shared buffer the program seems to run smoother with less stuttering, but the fps drops. The stuttering only occurs when flying over the terrain and it is changing LOD often.

My (final?) idea is going to be to go with creating vertex buffers per patch - however, instead of deleting vertex buffers when a patch gets deleted, I'm going to try and recycle that buffer by giving it to another patch to use. My hope is that using HBL_DISCARD on that existing buffer will be faster than creating a new buffer for the new patch and filling it - and I know I will gain however much time it takes to delete a vertex buffer.

Not that the program stuttering is bad - it is MUCH improved since moving the build process to a separate thread, and since throttling the number of buffers created/deleted/filled per frame.

*EDIT*

Just did some quick profiling test and can confirm that when using the single shared vertex buffer the time taken by _renderOneFrame() is about 6ms on average - 12ms worst case, and most of that time is reflected in the updating of the shared vertex buffer. However _renderOneFrame() averages 12ms - 80ms worst case when using static buffers and most of that time is NOT reflected in creating/deleting/filling the vertex buffers. Now if Xcode would display Ogre source code I could better nail down where in Ogre the overhead is coming from, but at least now I can see why the app is indeed smoother and less CPU intensive when using the shared buffer - though fps is lowered a bit.

Tradeoffs *sigh*(tm)

Hopefully my shared multiple vertex buffers idea will take care of both issues.