Damn Cg.. and shaders in general ?

Davide's picture

Today I hit a wall.. a performance issue that I was expecting sometime, but not quite as bad.

Currently I'm working with nVidia's Cg for shaders on OpenGL. One ugly thing about shaders is that one often ends up with a lot of permutations, depending on the number of inputs a shader deals with.
For example a shader may get vertex color in input while another may get texture coordinates in input.. it grows exponentially !

Each combination has traditionally been converted into a separate shader-program.. which is not nice, especially if there is no simple way to have all these programs instead share values that should be common to all of them.
For example any vertex shader will normally get one (or two) transformation matrix per-object, while an object is often built of different materials, thus different shaders.

In Direct3D 9 with HLSL one can set some virtual global registers, which is a bit ugly (there is a more high level system on PC but not on XBox 360) but does a nice job of sharing those values across programs. In Cg however, one has to explicitly connect shaders parameters.. which, as it turns out, can be pretty bad for performance.

My current shaders base is composed of 16 total shaders which all share 2 matrix transformations (to screen space and to world space). So, every time I set those two matrices for a 3D object, it really sets 32 matrices and it possibly even stalls somewhere.. because my frame rate will drop drastically for 1000 objects !!!
Basically I'm there not rendering anything and seeing FPS rate going from 90 to 15 just by setting up the transformation matrices for 1000 objects.

Supposedly Cg 2.0 (if combined with the latest nVidia cards ?) allows to have a common buffer that all shaders can share.
I'm curious to see how Direct3D 10 deals with those shaders params and state changes in general... however D3D 10 requires Vista, an OS that isn't quite as approved yet in my company.
Given the situation, I will be getting a new computer to be used to run Vista. As I picked the hardware I felt a bit guilty asking for relatively high end stuff.. but it's for work.. really !

wooooooo

P.S. Let's see if I can go to sleep before 4 AM tonight !

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

You might want to consider

You might want to consider streaming matrix data into the shader instead of passing it through constants. I think thats the standard performance optimization for instancing. There are some papers available on that.

In this case I'm assuming no

Davide's picture

In this case I'm assuming no instances.
I"m not sure how I"d stream matrix data.. but in any case I think the first 2 simple options are:

- Make a long shader with dynamic branching (but that's limiting for plain custom shaders)

- Avoid using cgConnectParameter() and handle shared parameters directly to minimize setting: it's safe to assume that an object will not touch all the shaders, therefore there is no need to automatically scatter the matrix param to all the shaders

Cg this Cg that...

Duddie's picture

Blah, all you people use Cg and then complain that it does not do it's job. What the hell is wrong with you. It is not that the shader can be of 10 MB code size. Where is the spirit of doing things low-level like it was in Amiga times? What's wrong with using low level shaders and write own lego-builder to reuse parts. Instead people use this crappy high level language that is so artificial and distant from anything there and then complain Cg this Cg that...
Get busy MFs!!!

Tell that to Sony who

Tell that to Sony who doesn't want to release shader assembler compiler, so everybody is basically forced to write shaders in Cg on the PS3.

Mhmhm :)

Duddie's picture

Why not reverse engineer it? :)
But Sony is not a good point here as they do not intend to be any portable or compatible with anything. I was talking about low level and high level shaders on OpenGL where it will be both portable and compatible with any cards and APIs out there.
In case of Sony I somehow understand them. Their HW is crap and they want to make sure that when they have PS4 working totally different from what they have now, they want to make sure the SW will be working there (at least possible to emulate). Look at those PS1 games abusing HW in low level, they have big problems running on emulation in PS2.

I think in case of the PS3

I think in case of the PS3 its more of a legal bullshit issue than anything else.

Nvidia doesnt want to release technical data for its hardware.

I think Sony would've been more than happy to release any information to attract developers to work with the PS3, because right now people are staying away from it for obvious reasons.

I'd gladly get rid of

Davide's picture

I'd gladly get rid of shaders as they are.. I've never been a big fan, but it's going to be code that has to be parallelized by either a compiler or a human ;)

Linearity

Duddie's picture

Since most of software is designed by humans in linear way and we really have no language that handles parallel things...
Thus it is way much better for human to just make things parallel at the lowest possible level.
I really was always wondering for the purpose of those Cg compilers and so on.

Some things are somewhat

Davide's picture

Some things are somewhat easy to make in parallel.
One reason why z-buffer is so popular is that one can write on it at random order, which makes it easy for multiple processors to cooperate.
However if you write triangles at random, cache coherence is probably not so good (but actually could be not so bad considering that one writes one object at the time and that most triangles can easily be grouped together).
What really gets into the way of coding cleanly however is the vector kind of parallelism, where one has large registers that can handle more than one pixel and tries to group 2, 4, 16 pixels at the time... that can really mangle a loop, as it needs special cases for portions smaller than the vector size and then needs to align to memory based on that same vector size.

Also add pipelining into

Also add pipelining into this and it becomes even more convoluted =)

technology of information

Davide's picture

In the end it's all about memory !
How it is organized, how one is supposed to access it, how big it is !
Recently at the IDF I learned about the Intel 80 cores project and how they are looking into building 3D grids for the future. That's for connecting processors but speaks for memory too.. whatever new technology comes up, everyone is still very realistically tied to caching.
Still, my first lesson about cache was that it's better not to try to be too smart.

I guess the ultimate

I guess the ultimate solution is to have all the CPU cores, the GPU, memory, network controller etc on a single super-chip =)

That way you dont need any caching =)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.