Monday, 17 February 2014

This Is Not A PR Blog Jim, It's a Dev Blog - Look Away


If you want good news, best not to read any more of this blog and wait for later in the week :) If you are brave and want to learn about the real world of software development, read on..

I integrated my new occluder into the Reloaded engine (you remember, the one that did amazingly well in my simple one occluder prototype). Only took a few minutes really, and the results where rather poor. My un-occluded scene I created from 50 buildings rendered flat out at 194 fps. This was my benchmark. I then added the code to submit all buildings as both occluders and ocludees and run the occlusion system on the exact same scene view. My frame rate dropped to 101 fps and my polygon count jumped by 90,000 polygons. Oh woot!

I did some investigative tests and it seems when I removed the 'GetRenderTargetData and System Surface Lock' commands, it jumped back up to 186 fps. My conclusion was that the 'slight' GPU stall I was anticipating is in fact a huge stall when you are running at frame rates in the hundreds.  

When I put it back in and reduced the depth buffer rendering to one draw call, it was slightly better than the worst score at 129 fps. This means the GPU lock is the biggest spender and the fifty depth scene render calls is the next cost to bear.  As I could not avoid the big stall, I worked to combine all the 50 building geometries into one large vertex buffer to do a single draw call but this only yielded a marginally better frame rate of 105 fps.

The good news is that this overhead will not get any bigger as the scene grows in size (with a few optimizations I have in mind). The cost is in the depth scene render and the GPU stall, and those won't get any bigger which means I can throw thousands of objects at it and the cost will be the same (or near as dammit) as ten objects.  I cannot rely on that assumption until I have field tested this new occlusion system with some other machines (and other users).

I then created a second level which had 25 barrels hiding behind a tent, and the system correctly occluded all the barrels when I stood behind it, and the draw call count dropped respectively.

The good news ended pretty soon though as the barrels exhibited visual popping because they where trying to occlude themselves (not a good idea to have an object that is both occluder and ocludee), and the last thing I wanted to see what popping (the whole reason this new occlusion system was created).

More Work To Come

If you read this far, I will now treat you to the very good news.  I half anticipated all the issues above, and despite the laundry list of woes I am quite pleased with how the HZB is able to work out occlusion and distribute that through the engine.  My next plan is to create a 'preferred occluder' system which only selects near 'large' objects for the occluders which will stop that annoying popping and speed up the depth render stage.  I can speed up the management of the return results by allocating some fixed memory instead of creating and deleting the memory allocation every cycle and perhaps my MOST AMBITIOUS plan of all, to eliminate the GPU stall.

I've hunted around I could not find any clever white paper which solves this issue, and prefers to put you onto DX10 and DX11 to solve it with the much friendlier stream-out operation.  It seems DX9 coders are left to fend for themselves with this problem, and my idea is utterly radical and perhaps just as slow as the current GPU lock.  You are welcome to stop me if you think the idea is mad..

Instead of doing a 'GetRenderTargetData' command to get the visibility results back into CPU memory so I can switch objects on and off, I instead redirect the occlusion visibility texture (which contains little 1's and 0's for each object being represented in the scene) and pass it to my entity shader as a new texture. I then use what is called a vertex texture fetch to grab the visibility state from the texture produced by the occlusion system. If the value is 'not visible', I simply adjust the vertex position to 'behind the camera' which will force whole object to skip sending the polygons to the fragment shader. The draw call for the object would still be made, but the shader would quickly reject all its polygons and move on.  Not sure if a draw call that renders no polygons is a freebie, or still a performance drain, but that plus the vertex texture fetch are the two problem areas I anticipate.  If my fears are unjustified and the performance hit is negligible, I will have created an entirely GPU-only occlusion method in DX9.  The reason I am confident is that the current method produces a GPU stall that effectively halves my frame rate, and the benefit of a non-stalling occlusion pipeline will be apparent the moment I finish coding it and run a test (fingers crossed).

It's another 3:30AM in the morning, and no sight of my normal 9-5 day so far, but if we can crack this occlusion question and have it perform splendidly in the main Reloaded engine, I can draw a clean line under it and move on with confidence.  No sense moving on until then (unless it starts to gobble up weeks!).  As a fallback plan, I emailed a middle-ware company that provides one of the advanced occlusion methods used by Unity (apparently) to ask them for the price for including their tech. No reply yet, but from experience the answer is usually (you cannot use it in a game maker) or a number with many zeros on the end.

Still, the occlusion system seems to be holding up very well, and if the frame rate never drops below 80fps, even with the current stalling system, it is still a benefit over an engine that would otherwise slow down as you start hiding objects around your scene.  Plenty more occlusion news to come, watch this space!


  1. I died a little inside at the reminder that the engine is still using DirectX 9. I still think it's a bad idea, and you've already ran into one issue because of it.

  2. Lee, check out this link. It seems he got HZB to work with DX9.

  3. My current one also works, but the stall hits the FPS. The guy also reports the same stall in his article "- Less stalling (just once for fetching the results from the GPU (dx9))". You cannot get away from this stall if you want to transfer visibility states to CPU memory. My Entirely-GPU approach might solve this though, watch this space!

  4. well lee m8, good luck hope it works out well :)

  5. Believe me when I say I only have extremely limited knowledge of how graphics APIs work internally, but from my tiny experience, I would suggest that it's likely to be much faster to render when fully on the GPU. Again, I really have no idea, but that's my instinct.