Maximum IO & Minimum CPU with a lock-free embedded no-SQL Linux/C++ object data structure template library

** DRAFT **


Be them open source or proprietary, communicating over the network or embedded into our applications, even the ones able of handling Big Data... we are never comfortable with the performance of traditional databases.

Since we need a more performatic solution, we had to deconstruct the principles and motivations which started the database era. The main characteristics of modern databases may be summarized:

  • human beings should be able to do anything they might want with the data without having to write an additional program;
  • data is sensible and access control is always enabled -- except, maybe, for SQLite;
  • the ACID properties must be honored.

Databases with similar characteristics fist appeared almost immediately after programs started dealing with valuable data. From the boss's perspective, you'd better put it into a vault, with alarms, guards and all the backup stuff -- and pay the price for it with a smile. It happened in the 60's.

Deepening into the vault analogy, today some use vaults where they just needed wallets. Can we do better?

  • Is it really always necessary to provide means for human beings to fully and directly manipulate any kind of data? If not, maybe we don't need SQL at all;
  • Isn't there any applications which don't use sensitive data? or whose data is protected via other means? if there are, we don't need authentication nor access control mechanisms;
  • Can't some applications guarantee the Atomicity, the Consistency and the Isolation of the ACID properties?
  • Do all applications require such rigid Durability (from the ACID properties)? In case of data corruption, can't we start over for cases where the data may be reconstructed? Temporary tables on a database don't honor the Durability principle anyways.

Some might think it is impossible to build an important system like this, but, hey! I just described the properties of the RAM memory! Any data who will ever be stored, participated for a longer or shorter time on this "not to be trusted media".

Don't get me wrong: I'm not writing about a memory database; I'm writing about an on-disk memory mapped database, which, considering it is done for C++, implies it is an Object Database -- or an Object Oriented Database.

Object Databases tend to be faster, from the start, than Relational Databases because Object Databases don't have to resolve relationships: the structures points directly to one another -- and this saves a lot of costly index lookups or, even expensivier, full table scans for the folks who let their databases go misindexed.

I, hereby, present TransparentDB++: it cannot be called a OODBMS, but many applications will live a happy and efficient life with it anyway; humans, not so much: you'll have to delegate to your programs any data manipulation you may want to do with it -- just like it happens with your RAM memory -- and there is no default GUI or CLI, although one could possibly be written.

TransparentDB++ is a lock-free ultra-fast embedded no-SQL Linux/C++ object data structure template library designed to query or update at the hardware's max IO throughput costing just penny CPU cycles. It may even be faster: when querying, if the data is in the cache, there is no overhead regarding IO operations and it performs as fast as memory-only arrays, lists, hashmaps and so on, since it uses mmaps for it's backing store; on the other hand, if you are adding records and there is enough free RAM, it may behave just like setting the elements of an array.

The principles TransparentDB++ brings on in dealing with data are not suitable for applications where data is the most valuable asset, since the role of Database Administrators doesn't exist. Speed, indeed, is achieved by dropping all requirements traditional DBMSes have regarding their need to be operated by humans, like:

- SQL, stored procedures and the like -- you'll have to write a C++ routine instead;

- Atomicity -- most of this one comes for free on a well designed object database: all child objects are only visible when the root object, the last to be created, is added. For more complex cases, the program should use something like mutexes;

- Transactions -- we don't provide any version control: it is up to the program to unroll from any complications it has put itself in in case of exceptional errors;

- Query Planning -- such plans should be made by the developer, when choosing and testing which indexes works better for their data;

- Instrumentation -- We have those only in debug mode;

- Security Related stuff -- we hand over that to the filesystem: if an user has read permissions on a database file, he/she will be able to query it; the same for write permissions.

- Catastrophe prevention -- no partial writes may corrupt the database, even when performing in-place defragmentations -- and we don't flush on every operation. We assume the process might crash, but the OS never will.

Being designed to be operated by programs instead of by humans doesn't mean you'll lose your data if you lose your program. It means you'll have to write another program. The database files still carriy the schema of your data, if you want them to.

Being embedded doesn't mean the database can't be accessed simultaneously by different processes. We use mmap, so the database is a shared memory for any process running on the same machine. TransparentDB++ has meta information to protect against any version issues on its internal structures and provides it's own mutexes client processes must use to avoid concurrency problems. Mutexes impose a performance penalty, though. Your application should also use the meta information infrastructure to control it's own internal structures format among different application versions.

Note that a malfunctioning application probably will damage the data and might turn the database completely invalid. Although TransparentDB++ can't help you in any way regarding fixing data corruption issues, you are encouraged to use the filesystem features. If you use a versioned filesystem (like NILFS, ) or a copy-on-write able fs (btrfs, ext3cow), the recommendation is that you rotate your pointers every time a new version of each process able to write to the database is deployed. BTW, remember that everyone always knows someone who just forgot that the DELETE FROM <table> sql command should probably take a WHERE clause... so not even traditional DBMSes are immune to buggy applications. Do backup.

Since we use indexes, there may be a trade-off between insertion and query speeds. Costly indexes, such as sorted (or even hashtables who needs to expand their capacities) demand CPU juice. We try to mitigate a possible CPU starvation by putting the needed changes on a queue, which is executed respecting the user given limits. So your data is added directly to the structure, but you should call a function to inform the index manager which data was added and there are both synchronous and asynchronous versions: the asynchronous index treatment returns immediately, but a subsequent query probably won't return your data, since the indexing routines probably haven't had time to run yet; the synchronous-able, on the other hand, is able to wait until the indexes are fine and queries on the fresh data are ready to work as expected. About the "queries won't return your data" statement, please note that this is only valid for index and filter queries, since object references and record traversals don't use indexes at all.

For those who want really, really fast queries (and accept to pay the price), we have "filters". Like indexes, they also demand CPU processing upon data modifications but, unlike indexes, they are not meant to reference all the data. Cheap filters are just arrays whose elements do pass a specified criteria -- traversing through them are as fast as traversing through arrays -- the elements order are somewhat related to insertion order. A little less cheap filter will guarantee order on the results and will still be traverseable as an array. The expensive filters act like cheap filters and indexes combined: they may be inquired about the presence of a certain value very fast, which is useful if you'd like to combine filters -- (( maybe combining filters and index is a bad idea and it is better to build a new filter instead, acting as a combined filter, as this would be the only case where filters could be inspected for holding an element or not ))

When you create a DB, you should place in all structs and classes that make a close reference loop -- it is not possible to reference objects on another DB.

  • 1 to 1 Relatioship pointers should be typed as TransparentDBRef<YourType> yourObject;
  • Lists may be declared as TransparentDBList<YourType> yourListOfObjects;

Classes and Structs are only able to use known types. Known types are all C++ primitive types, all user defined types declared on the DB instance, the reference type above, arrays for the previous ones and the std::string / estl::string / cstring types. Be careful, for trying to change strings in place is an easy way to corrupt the data. Anyway, all strings are stored together on a separate file and it is recommended to create new instances if some of them needs to be changed.

Data is represented on the DB file as a linked lists: for every class and struct on the closed reference loop, there is an implicit field called '_transparentDBNext' which equals to nullptr on the last element. Adding a new element of any object type is just a meter of getting the location of the next free address, populate it and set the previous pointer to the new one. In case of complex object references, the reservation order should maximize the benefits of the read-ahead feature: when commanded to read a page from the disk, not only one, but a bunch of subsequent pages are also read. So, if you are writing a complex object reference, reserve first the root object, then the first sub reference (or list), then the second sub sub reference and so on, even if they will be populated in reverse order: information will be buffered and will be written linearly -- if you have strings, remember they will be created on another file, so their order relative to the objects don't matter; just their order in relation to one another do. Do your best to have them read linearly on access time.

To reserve subsequent space for new objects on the DB file, even if multiple threads are writing, call:

  • object = DB.new<ObjectType>();
  • object* = DB.newObjectType>(n);
  • auto[rootObj, subList1*, subSubList1_1*, subList2*, subSubList2_1*] = DB.new<RootType, SubType, SubSubType, SubType, SubSubType>(1, 5, 7, 3, 1);

The position 0 of the DB file is special: it is where the comments + meta information resides. The comment section goes up until a \0x1a (^Z / EOF) character is found and the rest of the meta information location is indicated in the comments. The comments are a text containing panic information that might be useful if a manual restoration of the data will ever be needed. It contains:

  • The location, on the meta information page, of every pointer and mutex;
  • The word size (32 or 64 bits);
  • The page size (4096 bytes);
  • The version tag for TransparentDB++ data structures -- if TransparentDB++ don't recognize this number, it will refuse to open the database;
  • The version tag for the user defined object types schemas;
  • The schema for every object represented by this database;
  • The strings file name for this database
  • The filter condition, filter name and the object type for each filter;
  • The indexing algorithm, object type and field to be indexed for each index;
  • The update log queue file names for all object types on this DB

On the next page, comes the meta information itself:

  • The EOF pointer, representing the offset in the file where new objects should be added;
  • The START_OF_DATA pointer, usually a page after this one and the lower value between all HEAD_ and DELETED_HEAD_ offset pointers;
  • The HEAD_ pointers for each root object type represented on the database;
  • The DELETED_HEAD_ pointers for each object type;
  • The LAST_FILTERED_ pointers for each root object type. Some heavily used filters might be updated as soon as the information is added, some others may take a lazy approach: they are updated only when querying takes place; some filtering and indexing may also be temporarily turned off -- when batch inserting, for instance. These pointers keep track where to continue;
  • The LAST_INDEXED_ pointers for each root object type. Whenever indexing should resume, these are consulted and updated;
  • All mutexes the user requested to be created.

Note that the DB file is a sparse file and the listed size will always be greater than the EOF pointer.

Deleted objects will be taken out of the TransparentDB++'s HEAD_ list and will be placed on the respective DELETED_HEAD_ list -- without any other modification to the object. A later process might regroup and reclaim the space in the future. Reusing the space for new objects is possible but will likely have the data scattered through several noncontinuous pages, hurting read performance.


Filtering and Indexing:

When inserting, deleting and updating, some race conditions may occur with other querying threads, because these operations are done in two stages: first, the raw data is modified, then the indexes and filters are updated. All sorts of false-positives and false-negatives may occur on queries at these brief moments -- if this is an issue with your application, you should consider using mutexes. For the insertion, queries just won't be aware of the new data; for deletions, the problem is the opposite and may get serious, since the data might be invalid; updates are the worse, because they might be tought of like a delete followed by an insert.

Our solution for the insertion problem is easy: all insertions are sequential, so they are handled by the meta information page pointers LAST_INDEXED_ and LAST_FILTERED_, which just need to follow the pointers on every object. A similar approach is done when deleting, since we maintain a DELETED_HEAD_ list for future space reutilization; Now, for updates, we do not have any lists, so they must be constructed specially to deal with the indexes and filters: these are log queue lists, one for each object type. Like insertions and deletions list, they will keep a log of what has been done, until they are eventually cleaned up.

Filters are just lists of pointers to objects which passes a certain criteria. It needs to be fast, so the list is implemented as an array (a linked list would be a 100% overhead). For this, they need their own file, allowing the array to grow at will.

--> Should the database have a hash table? Can we make the indexes accessible like IDXNAMEFORSTRUCTNAME["a"] to return structname's instance?

What would be a complex query showing that query planning isn't necessary and would demonstrate the power of indexes + filters?

An example for finance: The struct DEALS records all negotiations which took place:

Deal 0: 03/21 09:16: 1 apple from Cindy to John for $1.82 each

Deal 1: 03/21 12:31: 5 peaches from John to Cindy for $2.19 each

Deal 2: 03/22 14:19: 3 apples from John to Samantha for $2.03 each

Deal 3: 03/22 16:47: 5 apples from Samantha to Cindy for $1.98 each

Questions:

  • List of persons who won/lost money on a given time interval
  • Last negociated prices for each asset
  • For a given asset and a given past day, did the prices went up or down on the next day?
  • Average unit price for all assets on a given day