Book Notes: Data Oriented Design
Here are my rough notes from reading "Data Oriented Design" by Richard Fabian, which explains how to write code that runs quickly, uses less memory, allows higher scaling, and enables more code reuse by exploiting data-oriented techniques from the games industry.
The book is freely available online; I read the dead-tree edition.
How did I get in the position where I'd be interested in reading this book?
- A few years ago I watched Mike Acton's CppCon talk about how memory bandwidth is the limiting factor for code, but we are writing code assuming memory is cheap, and squandering 90%+ of the CPU capacity.
- I heard Andrew Kelly (lead Zig programming language developer) used this book to redesign Zig compiler data structures to be data-oriented. Andrew sums up the technique as 'use less memory, go fast'.
- At work, I'm optimising a large android app performance at work, so naturally I'm interested in how to write performant code on modern CPUs.
High-level criticism:
- I found this self-published book overall an interesting and sometimes mind-blowing read.
- I would only recommend the book if you really need to eke out the last drops of performance from your system.
- Skip Chapter 1, which is hard to keep track of: a bit too high-level, and somewhat ranty. Go straight to Chapter 2.
- The book would have benefited from stronger editing. I don't see an editor credited.
- Hopefully soon a better-edited book, or another edition will come out, which should be an easier recommendation.
- The book advertises itself as 'Software engineering for limited resources and short schedules' -- I don't think it really has much at all to do with limited schedules.
- All the same I applaud, respect, and thank the author for publishing this rough-ish draft: as they admit in the introduction: "there needs to be a book on this subject, and even though I'm not the right person to write it, it's probably better it exists, so someone can criticise its content and write a better one". If you go into this book, understanding that, and understanding that it's probably the best resource on this topic that exists so far, you shouldn't be disappointed.
Here follow my notes:
Store your in-memory data like a Relational Database would:
- This refers to the classic 'Out of the Tar Pit' paper, and applies its ideas.
- Store your data in memory as you would in a relational database
- Give your data a primary key -- can just be an auto-incrementing integer, or a UUID, or a name string.
- Store your data column-oriented. Both for speed and for conceptual integrity. Ideally store your data in an array keyed by the id integer.
- The 'entity' (e.g. player/enemy/ship/gun/pickup/inventory) is now just a key -- the contents of the entity are a decentralised collection of rows in different tables. Could have a position table, velocity table, enemy-AI table, mesh table, animation table, health table. The tables are just tables, no logic.
- Generally, this looks like transposing your class into struct-of-arrays form, to emphasise the collective positions over the instance's fields.
- This makes it easy to design systems that operate over multiple tables (e.g. physics: position column += velocity column), rendering (loop over all meshes and animation columns)
- Normalise your data using the database's normal forms
- Steal ideas from relational databases: sets, indices, joins, join ordering, query optimisers.
- There's a fantastic example in Chapter 2.3 taking an object-graph representation of a roguelike game (with meshes, textures, animations, pickups, rooms, doors) and slowly refactoring it into a column-oriented relational in-memory database.
- Avoid columns that can be NULL -- break them out into their own tables where the presence of a key indicates non-null.
- e.g. players make sounds, but boxes don't -- instead of having a nullable sound column in your entity table, have a separate 'sound' table keyed by entity ID, which only has rows for entities that make sound.
- This makes your algorithm run over the sound column fast, without branching, and saves memory capacity and bandwidth.
- Avoid boolean columns, in the same way:
- Many booleans indicate set membership, and are used in algorithms that branch heavily, breaking SIMD and branch prediction. Avoid this.
- Instead of storing the boolean (boolean Entity.is_alive), store the set of entity IDs in a table of 'alive entities'
- Then for all the logic to do with live entities, you can scan the alive table, and do no work at all for the dead entities.
- Enums can be normalised similarly to nulls and bools above:
- Enums often indicate set membership, so model it as entity ID membership in a column.
- This can also open up the possibility of an entity being in two enum states at once, which can lead to creative emergent game ideas.
- Fields that almost always have the same data can be implicit by absence
- e.g. most Entities have 100% health. Have a 'damage' table keyed by entity ID -- if the key is not in the table, the health is 100%. This can save a lot of memory and processing vs having every entity have a health field.
- This is compressing your data using domain knowledge
- You can go 'too far' in normalising your data. Same as in a relational database schema. See what works in practice for you.
- If your data is in relational format, you can write your logic as stateless transforms of this data. This is great for testing, debugging, understandability, and parallelism.
Once your data is in relational format, you can do less work, by only processing entities that are present in certain tables:
- e.g. In the physics system, you can stream the Velocity table, only updating objects that have velocities, skipping every other object with no velocity.
- e.g. in the rendering system, you can stream the Mesh and Texture table, only updating objects that are renderable, skipping all invisible objects
- The author calls this 'existential processing' -- only processing entities that need to be processed, and not even loading the memory for elements that don't have to be processed.
Consider compressing your data using "hierarchical levels of detail":
- Games have a concept of "Level of Detail" or LOD -- e.g. they load more compressed textures for things that are farther away, because you can't see their fine detail, to save memory and processing.
- The author argues we could use LOD a lot more -- for all sorts of entities. Gives the example of a anti-aircraft gun game where waves of squadrons of fighters come at you
- When the attackers are far away, you an model the data as a wave, with counters rather than meshes
- When the wave get closer, you can pop the wave and push a squadron -- with counters of planes and aggregate levels of damage
- When the squadron gets closer, you can pop the squadron and push individual planes with individual health, using more memory and more detail
- If a plane gets further away, it could be re-aggregated back into a squadron, saving/compressing memory.
- All this is fairly hard to do if you insist on modelling the data as individual planes at all times, and it's not straightforward to do in OOP code, but if you have tables of waves, squadrons, planes, it's a fairly easy thing to do: delete some rows in one table and add some rows to another table.
Be aware of the real hardware you run on:
- Memory bandwidth:
- Main memory loads are very slow compared to register and cache loads
- Memory bandwidth is a common constraint
- Algorithms courses generally teach you code that runs on a hypothetical machine where memory access is constant-time and cheap compared to instructions, and is thus incorrect.
- Pipelining:
- Modern CPUs are deeply pipelined -- they will stall if they can't guess the next instructions
- Virtual function calls (C++ vtables) generally cause memory loads and stalls (forcing them to load first the vtable and then the function the vtable points to (2x main memory loads)
- Loops where the condition is usually the same is good for the branch predictor
- Random work is very bad, causing lots of CPU stalls
- Cache evictions:
- Doing lots of unrelated work (e.g. if you recurse down a tree of different-classed virtual game objects calling 'update') will blow out your instruction and data cache.
- Prefer grouping data together in columns/arrays so that the same computations can run all at once on similar data.
- This requires the column-oriented store
- There are other apps running on the caches too -- you could be evicted at any time -- so try to keep your data small and your code will be faster.
- Cache lines:
- Usually loading 1 byte loads a 64 byte "cache line". You can't just read 1 byte of memory, the CPU brings in 64 bytes at a time.
- Reading these extra 63 bytes is generally almost free -- and you can build algorithms around this -- e.g. use B+Trees or B*Trees that branch many ways, and fit their branches in the cache line, rather than binary trees that only branch two ways and require more cache line loads.
- Vectorisation:
- SIMD instructions are fast
- Compilers can usually generate them… but only if your data is contiguous and aligned
- You can help compilers generate these instructions by doing the same work over similar data in a tight loop.
- Generic update() methods that change many fields of an object with different operations make it very hard to generate SIMD instructions.
Algorithms:
- Searching:
- Ideally you can just index into an array
- Make indices if you have to, just like a database would. Sort and binary search, or use a hash table.
- The author mentions that stuffing an entire skip list index into a cache line can outperform binary search.
- If you're searching for the max-value, pre-index them, or pre-index the top N values.
- B-trees have better cache performance than red-black or AVL trees
- Hashtables with chained collision resolution can benefit from storing many colliding key/value pairs together in the same cache line, so that a single hash lookup and cache line load is sufficient to find most key/value pairs.
- Sorting:
- Avoid sorting where possible. Sometimes you don't need to, and you can pre-bucket your data in the order you want.
- Don't sort just to get the top N items, there are cheaper algorithms for that (e.g. heaps)
- If some sort fields are discrete (e.g. if you always sort meshes before textures in your GL buffer upload), separate your data into discrete buckets (meshes, textures) before sorting, and you will save work.
- Sorting networks are a cool trick I hadn't heard of before to sort in predictable time.
Thread Safety:
- Avoid mutexes as much as possible because they serialise your code. most of the time you don't need mutexes if you can reorganise your data
- Prefer inherently concurrent operations -- mapping, filtering -- stateless transforms. Even associative reductions can be concurrent.
- Reader-writer locks of columns are better. Usually there are more readers than writers.
- Transforms of data can be copy-on-write -- you can use techniques like double-buffering the output of your algorithms to avoid too much allocation. Double buffering is usually used for pixel output but it works fine for any transform algorithm.
Criticisms of OOP:
- Huge inheritance hierarchies are bad
- Huge inheritance hierarchies don't let us add new functionality easily
- Base classes tend to accumulate lots of cruft
- C++ virtual tables break caches and branch predictors, forcing them to load the vtable and the function the vtable points to (2x main memory jumps). I think Java would have the same problem.
- Memory storage is row-oriented meaning if you want to (say) update the position of all instances, you are loading the position and velocity of the instance along with many other unrelated fields -- wasting your memory bandwidth
My discussion and further questions:
- The idea of transposing your data from row-oriented (instance-oriented) to column-oriented breaks my brain a little. I was glad I'd read Out of the Tar Pit and been primed by Mike Acton's CppCon talk, otherwise I might have rejected this outright -- after doing OOP at work for 10+ years it's a hard habit to reconsider.
- I'm very interested to try some of these ideas, though I dread the idea of refactoring instance-oriented code (particularly when it has instance-oriented unit tests that will all need updating). I think perhaps a new feature would be a more easy place to make this shift, or a feature that only has high-level integration tests rather than per-class unit tests.
- How can I apply this in Android/Java, where we don't have structs and arrays of structs? I did a Twitter thread while researching this. Generally I don't know of many patterns for making a relational-style in-memory database in Java -- this area seems immature.
- How to struct-of-array?
- float[], byte[], int[] would work well but you'd have to manually expand them
- fastutil FloatArrayList, ByteArrayList, IntArrayList should work fairly well, with expansion already built-in
- Maybe Fastutil Int2IntOpenHashMap if I need to delete often? The open-addressed hashmap causes much less memory pressure than chained resolution in java.util.HashMap, which stores collisions as tree nodes, sprayed all over the heap.
- java nio ByteBuffers have methods for packing data into raw arrays
- java.util Collections (e.g. ArrayList/HashMap) would generally work poorly; they all auto-box primitives, causing hearty memory loads
- Remember Clojure's B+tree-backed immutable vectors might be a good fit. And you can use them from normal Java code. Perhaps Scala's stdlib has some similar collections too?
- Perhaps we could code-generate this from some higher-level schema definition?
- It sure would be nice if there was some higher-level schema to generate zero-copy contiguous column-oriented no-boxing data structures
- This blogpost is a really great example of a columnar store in Java with streams and iterators and very little allocation, using a mutable cursor over primitive arrays.
- How to join columns? This seems like a lot of complexity that's been kinda swept aside
- When joining, you have a choice of which order to join in. e.g. joining position with velocity tables, you could loop through position and then look up equivalent indices in velocity, or vice versa. Usually starting from the smaller table is faster.
- With 3 tables (A,B,C), there are 6 orderings you could join: ABC, ACB, BAC, BCA, CAB, CBA.
- Choosing an order to join is a combinatorial problem and SQL engines have very smart optimisers to choose a path through depending on table sizes, available indices, and fanout.
- I wish there was something similar I could use for my in-memory database!
- The examples all just say to loop the smallest table first, then the next smallest table, and so on until you get to the biggest table, and hopefully you don't have to read many rows of the biggest table.
- How to do all this without allocating in Java?
- You'd probably want to write Java code that looks a lot like C. No generics, to avoid boxing primitives. Raw callbacks, rather than generic callbacks.
- Java Streams of a mutable cursor with getters for each column that delegate into the underlying column storage would probably work well. (example)
- The open-source Artemis-ODB and Ashley Entity-Component-System frameworks both allow pooling of objects -- this is better for performance than allocating tons more data, but it's much worse than a struct-of-arrays, because the object headers are expensive, the data is spread over the heap, and the garbage collector has to traverse the references.
- Or should I just call JNI for a lot of this? What would the overhead be?
- Python notably has a great column-oriented dataframe library: Pandas, with SIMD operation optimisations, called as native code. I wonder if that is a model we could use also in Java.
- Some machine learning libraries have dataframe libraries (e.g. Tensorflow) -- I wonder if they could work as the datastore.
- Perhaps data-oriented programming just is too much against the grain of the language right now? Maybe I have to wait for Project Valhalla to give us inline structure types? I hope not; that's not close to landing.
- How to struct-of-array?
- Are there fast in-memory database engines that allow me some of the cool database engine features (fast joins, foreign keys, consistency, schema definition) at the same time as extreme raw speed?
- Maybe: non-SQL-string-parsing API -- or at least, precompiled SQL query?
- Zero-copy: Expose the data as a cursor over a contiguous chunk of memory -- rather than doing lots of copies?
- What would a data-oriented programming language look like? First class array-of-struct to struct-of-array transposing, probably? Zig and Rust might be able to do this.
Comments ()