Monzy — Kill dash nine
Coder Girl
IE is being mean to me
Write in C
Zed Show — Matz Can’t Patch (show text)
I could hardly believe how beautiful and wonderful the idea of LISP was [McCarthy 1960]. I say it this way because LISP had not only been around enough to get some honest barnacles, but worse, there wee deep flaws in its logical foundations. By this, I mean that the pure language was supposed to be based on functions, but its most important components — such as lambda expressions quotes, and conds — where not functions at all, and instead are called special forms.
Landin and others had been able to get quotes and cons in terms of lambda by tricks that were variously clever and useful, but the flaw remained in the jewel. In the practical language things were better. There were not just EXPRs (which evaluated their arguments), but FEXPRs (which did not). My next questions was, why on earth call it a functional language? Why not just base everything on FEXPRs and force evaluation on the receiving side when needed?
I could never get a good answer, but the question was very helpful when it came time to invent Smalltalk, because this started a line of thought that said “take the hardest and most profound thing you need to do, make it great, an then build every easier thing out of it”. That was the promise of LiSP and the lure of lambda — needed was a better “hardest and most profound” thing. Objects should be it.
”—Alan Key, The Early History of Smalltalk (1969)Tony Albrecht, Technical Consultant at Sony, tells a story about memory performance and object-oriented programming style.
If you read the story carefully, you will notice that the performance problem was actually solved by writing class-specific allocators (that keep each object of the same kind in a continuous array) + doing recursive algorithm in two passes.
Tony knows very well what happens on hardware level, but he is not good at object-oriented programming. Let’s see what is wrong with his code.
In the beginning, their code is not well organized: instead of multiple distinct objects with small independent data structures they had fat nodes with matrices, vectors and other structures built-in. Because of that, the various kinds of data were interleaved and didn’t play well with the cache.
First optimization: they made the code object-oriented by factoring big data structures like Matrix and Vector out of the Node. Then, they used the incapsulation (fundamental principle of OOP) to provide custom allocators with continuous memory zones. This is only possible in a proper object-oriented code when objects of one kind do not interfere with objects of another kind other than with explicit messages. So you can optimize memory layout for Matrices, provided their behavior is not changed, and you will not break some other part of the code. OOP helped to gain 35% of performance.
Second optimization: they splitted the recursive update procedure into two phases to avoid going bottom-top adding WBS (world bounding sphere) more than once per parent node. This would save about 25% of CPU time (assuming binary tree and no overhead on leaves). But they actually got about 50-60% increase because they used continuous allocator for nodes like they did with matrices and vectors.
This is all understandable. But there are two design decisions which are not justified:
1. In the first optimization Tony claimed that “excessive encapsulation is BAD” (slide 20) and thus decided to put raw pointers to the array of matrices and vectors outside of their respective nodes into the loop which iterates over the nodes (slide 89):
for (int k=0; k < innerSize; k++, wmat++, mat++, bs++, wbs++)
{
*wmat = (*parentTransform)*(*mat);
*wbs = bs->Transform(wmat);
}
Do you see those wmat, mat, bs, wbs pointers? These are private things pulled out of node objects under the claim of “excessive encapsulation is BAD”. Now object does not control its data and once you’d like to add another special-effects matrix over the node, you’ll have to learn not only the Node class, but the entire rendering codebase!
This is how it should be done actually:
for (int k=0; k < innerSize; k++)
{
children[k]->updateWithParentTransform(*parentTransform);
}
Where updateWithParentTransform does the job involving wmat, mat, wbs and bs and gives you guarantee that this is the single file where these variables are accessed directly.
Also note that this method will be perfectly inlined by C++ compiler or smart dynamic Smalltalk/Self/JVM system, so the result code will do the same operations and memory accesses as the manually inlined code with “naked” private pointers.
2. The second claim is to “Make the processing global rather than local” (slide 73). This is also awfully wrong. Tony suggests splitting the tree of nodes into arrays of nodes sorted by level. It is not only inflexible (or requires quite complicated algorithms to maintain the invariant), but is also pointless.
We already have these class-specific continuous allocators which put nodes close to each other. We already have extracted huge data structures from the nodes, so that we may keep a lot of nodes in just a fraction of the L2 cache while the rest of it is used for matrix operations. And we already split up the algorithm so that the parent’s boundary is not updated too often. But still, he claims some performance gain out of the fact that nodes are not traversed recursively, but rather linearly using quite a brittle memory layout.
There is no point in that since node objects are so small that most of the data you need to update children using parent’s transformation matrix is already in the cache. And for the cached data there’s no difference how it is positioned: the access time is constant.
But he did not only traded nothing for more complicated code, but also made his life harder to move from a single CPU to multiple CPUs (say, GPU): only recursive algorithms and encapsulation may give you an option to parallelize computation. By flattening algorithms and breaking encapsulation Tony cut himself a way to scale the performance horizontally (or, equally, made it harder to automatic parallelizing compiler to do its job).
It is very important to know how things work on low level, but is also important to know how to incapsulate low level complexity and free your mind for greater deals.
Update: Tony replied showing that I’m not entirely right. (March 18, 2010)