Oleg Andreev



Software designer with focus on user experience and security.

You may start with my selection of articles on Bitcoin.

Переводы некоторых статей на русский.



Product architect at Chain.

Author of Gitbox version control app.

Author of CoreBitcoin, a Bitcoin toolkit for Objective-C.

Author of BTCRuby, a Bitcoin toolkit for Ruby.

Former lead dev of FunGolf GPS, the best golfer's personal assistant.



I am happy to give you an interview or provide you with a consultation.
I am very interested in innovative ways to secure property and personal interactions: all the way from cryptography to user interfaces. I am not interested in trading, mining or building exchanges.

This blog enlightens people thanks to your generous donations: 1TipsuQ7CSqfQsjA9KU5jarSB1AnrVLLo

Pitfalls of lack of encapsulation

Tony Albrecht, Technical Consultant at Sony, tells a story about memory performance and object-oriented programming style.

If you read the story carefully, you will notice that the performance problem was actually solved by writing class-specific allocators (that keep each object of the same kind in a continuous array) + doing recursive algorithm in two passes.

Tony knows very well what happens on hardware level, but he is not good at object-oriented programming. Let’s see what is wrong with his code.

In the beginning, their code is not well organized: instead of multiple distinct objects with small independent data structures they had fat nodes with matrices, vectors and other structures built-in. Because of that, the various kinds of data were interleaved and didn’t play well with the cache.

First optimization: they made the code object-oriented by factoring big data structures like Matrix and Vector out of the Node. Then, they used the incapsulation (fundamental principle of OOP) to provide custom allocators with continuous memory zones. This is only possible in a proper object-oriented code when objects of one kind do not interfere with objects of another kind other than with explicit messages. So you can optimize memory layout for Matrices, provided their behavior is not changed, and you will not break some other part of the code. OOP helped to gain 35% of performance.

Second optimization: they splitted the recursive update procedure into two phases to avoid going bottom-top adding WBS (world bounding sphere) more than once per parent node. This would save about 25% of CPU time (assuming binary tree and no overhead on leaves). But they actually got about 50-60% increase because they used continuous allocator for nodes like they did with matrices and vectors.

This is all understandable. But there are two design decisions which are not justified:

1. In the first optimization Tony claimed that “excessive encapsulation is BAD” (slide 20) and thus decided to put raw pointers to the array of matrices and vectors outside of their respective nodes into the loop which iterates over the nodes (slide 89):

for (int k=0; k < innerSize; k++, wmat++, mat++, bs++, wbs++)
{
*wmat = (*parentTransform)*(*mat);
*wbs = bs->Transform(wmat);
}

Do you see those wmat, mat, bs, wbs pointers? These are private things pulled out of node objects under the claim of “excessive encapsulation is BAD”. Now object does not control its data and once you’d like to add another special-effects matrix over the node, you’ll have to learn not only the Node class, but the entire rendering codebase!

This is how it should be done actually:

for (int k=0; k < innerSize; k++)
{
children[k]->updateWithParentTransform(*parentTransform);
}

Where updateWithParentTransform does the job involving wmat, mat, wbs and bs and gives you guarantee that this is the single file where these variables are accessed directly.

Also note that this method will be perfectly inlined by C++ compiler or smart dynamic Smalltalk/Self/JVM system, so the result code will do the same operations and memory accesses as the manually inlined code with “naked” private pointers.

2. The second claim is to “Make the processing global rather than local” (slide 73). This is also awfully wrong. Tony suggests splitting the tree of nodes into arrays of nodes sorted by level. It is not only inflexible (or requires quite complicated algorithms to maintain the invariant), but is also pointless.

We already have these class-specific continuous allocators which put nodes close to each other. We already have extracted huge data structures from the nodes, so that we may keep a lot of nodes in just a fraction of the L2 cache while the rest of it is used for matrix operations. And we already split up the algorithm so that the parent’s boundary is not updated too often. But still, he claims some performance gain out of the fact that nodes are not traversed recursively, but rather linearly using quite a brittle memory layout.

There is no point in that since node objects are so small that most of the data you need to update children using parent’s transformation matrix is already in the cache. And for the cached data there’s no difference how it is positioned: the access time is constant.

But he did not only traded nothing for more complicated code, but also made his life harder to move from a single CPU to multiple CPUs (say, GPU): only recursive algorithms and encapsulation may give you an option to parallelize computation. By flattening algorithms and breaking encapsulation Tony cut himself a way to scale the performance horizontally (or, equally, made it harder to automatic parallelizing compiler to do its job).

It is very important to know how things work on low level, but is also important to know how to incapsulate low level complexity and free your mind for greater deals.

Update: Tony replied showing that I’m not entirely right. (March 18, 2010)