I recently came across an article on InfoQ explaining some architectural decisions made when building Reddit. While most of what was mentioned seemed pretty much in line with what you’d expect from the architecture of a high-traffic site, one thing stood out the most – Open Schema database model, also known as Entity-Attribute-Value model (EAV).
The idea behind this model is quite simple. In the case of Reddit, there are only 2 tables encapsulating everything – Entity (Thing) and Data. Entity is the table containing generic information (id, name, create date) for things such as users, comments and etc. Where as Data contains a foreign key referencing an entity along with the name of the attribute for that entity and a value.
Two of the strongest benefits of this model, as it is mentioned by the co-founder of Reddit, is that it’s very flexible to change and fast. There is no need to add tables or columns in order to extend functionality of the application and that could be a very daunting task, especially if you’re dealing with large data sets.
I would probably consider this a very extreme case of EAV. After all, this pattern sits comfortably with the standard relational model, especially if you’d like to allow your users to extend parts of your system.
However, I have never seen a full-blown implementation of EAV. One of the obvious drawbacks of this pattern is the result of its greatest benefit. Since all stored data is so generic, how can anything be queried? We no longer have column names and everything is represented through strings, at which point there is no data integrity.
Essentially, one would have to provide adequate data integrity and maintain consistency at the application level. The way i see it, a very painful experience, but can be accomplished.
Another issue which springs to mind is reporting. A complete EAV model has to be the worst thing to analyze and report on due to the severe lack of data integrity. No, actually it’s impossible.
I’d be very interested to see how Reddit resolved this problem. The only thing I can think of is transforming data into a normalized relational model and run queries on it. More work but for a good cause I suppose – performance.
Anyway, it’s an interesting decision and probably a worthwhile compromise, as it appears to have worked for Reddit. Nonetheless, I’m still on the fence about it. I’m no database architect, but all indicators indicate to me that this is not a relational design implemented on an RDBMS.
I don’t want to blindly call it an anti-pattern, as it is an obvious and perhaps the only solution under certain circumstances, but maybe its home is on a NoSQL platform.