2021-01-19

Data modelling

This is a draft paper about a lightweight data modelling framework that I am developing as a home project, for use in other home projects of mine.  It is incomplete; I will be amending it as I find time to write more and as my understanding evolves of what this framework is supposed to do.


Introduction


Every single software project in existence deals in one way or another with data. Some projects have small amounts of data, some have large amounts of data, some even have "big" data. The data almost always exhibit a certain well defined structure, known as the Schema, and the loosely defined term Data Model is used to refer either to the data, or to the schema, or non-specifically to both.

In virtually all cases, the data model is highly application-specific, but many characteristics and operations are common or even ubiquitous across data models.

  • Examples of common characteristics: 
    • Decomposition of the data into clearly defined entities
    • Decomposition of entities into primitive types
    • Identifiability of entities
    • Relations among entities:
      • Referential
      • Inheritance
    • Tabular and/or hierarchical organization
    • Naming of entities and their members
    • Referential integrity (self-containment)
    • Value constraints
    • A schema describing the above.

  • Examples of commonly used operations:
    • Serialization (to and from markup, e.g. JSON, XML, etc.)
    • Persistence (onto a Relational Database)
    • Querying
    • Filtering (obtaining a data model which is a view into a subset of another)
    • Duplication
    • Comparison
    • Set operations
      • Union
      • Intersection
      • Difference
    • Validation
    • Observability (change notifications)
    • Mirroring (maintaining two copies in sync)
    • Remoting (mirroring over a network)
    • Generic (textual) viewing
    • Generic (textual) editing

The above lists obviously contain many terms that come from relational databases, and this is not a coincidence, since a lot of theoretical work around data models has already been done in the context of relational database theory. However, the notion of data models is broader than databases, so the above lists also contain terms that are not normally found in databases, such as inheritance, observability, and serialization. In fact, a data model is not necessarily bound to secondary storage, nor does it need a language such as SQL to query, nor does it require something as heavyweight as a relational database system to manage. The simplest and most lightweight form of a data model is nothing more than a graph of objects in memory, in which case its schema is just the class graph of the objects.

Definition:

A data model is a finite set of structured and interrelated data.

In order to avoid confusion I will be referring to heavyweight data models of the traditional kind as "Estate Data Models" in the sense that they tend to represent the main data estate of an enterprise.

Programmers often build and/or use class graphs without realizing that they are in fact data models. In a file system, the data model consists of directory entries, users, and groups; a directory listing utility can be thought of as performing a query on the data model of the file system to obtain a subset of it and display it. However, creators of file systems and of directory listing utilities rarely view their creations this way.


The Problem


Data models built using only features of the programming language at hand are ad-hoc, so their structure and intended use is only obvious to the programmers who write them. Contrast this with structured approaches, such as those found in relational databases, where to every programmer who looks at the schema it is obvious what the entities are, how they are related to each other, how to stipulate a query, and what results to expect when that query is executed. Still, relational database implementations tend to be heavyweight, and interfacing with them requires some bureaucracy, so  unless a database is absolutely necessary for a project, nobody adds one just to keep their data model neat and easily understandable.

In many projects that I have worked on I have identified problems that arise when our data model is expressed as an ad-hoc class graph, and I have come to identify potential benefits that could be realized if data models were represented using some lightweight general purpose data modelling framework.

  • In ad-hoc class graphs, the structure and content of the data model tends to be tied to the programming language at hand; a general purpose data modelling framework would introduce and enforce certain language-agnostic conventions, thus allowing data models to be shared across languages.
  • In ad-hoc class graphs, various characteristics of the data model are implemented in non-standard, application-specific ways; as a result, even though we are often expressing well known concepts, it is not clear by looking at the code that a well known concept is being expressed. A general purpose data modelling framework would enforce the use of certain patterns, thus making each characteristic readily recognizable.
  • In ad-hoc class graphs, many data model operations are implemented in application-specific code, thus essentially re-inventing wheels; in a general purpose data modelling framework such operations would:
    • be free from application-specific concerns, (i.e. operate in an abstract context,) thus making it a lot easier to write them and understand them.
    • be implemented once and forever, or
    • receive different implementations that would compete to see which one is better.


Existing solutions


Unfortunately, I do not know of any lightweight, general purpose framework for programmatically expressing data models in a standardized way; instead, there exist various frameworks which operate on ad-hoc class graphs, each of them performing a specific narrow set of data model operations, and realizing only those characteristics of data models that are pertinent to those operations. For example:

  • Object-Relational Mapping (ORM) frameworks offer persistence and querying to arbitrary class graphs.
  • Serialization frameworks offer conversion of arbitrary class graphs to and from markup languages.
  • Some GUI frameworks offer mapping between arbitrary class graphs and GUI controls.

Other frameworks do various other things, each one of them within its own narrow context.

Each of these frameworks employs reflection to make sense out of the class graph, which must usually be aided by extra notation (known as attributes in C#, annotations in Java) that the programmer has to add to classes and their members, to indicate how various ad-hoc programming constructs correspond to well known concepts such as keys and relationships. So, if on a certain class graph we want to have both persistence and serialization, we have to add a separate set of extra notation for each framework, resulting in considerable code bloat; furthermore, each framework will be performing its own reflection, translating into unnecessary overhead.


The Ubiquity of Data Models


Since I identified the need for a universal lightweight data model framework I have been paying special attention to the concept of data models, and I have realized that data models are everywhere, even in places where I could not see them before. Software projects quite often seem to utilize not just one but multiple different data models, serving different purposes.

For example, in an MVVM application, the set of viewmodels can be thought of as a data model in its own right, separately from the estate data model that the first letter of the acronym stands for. Of course in the case of viewmodels certain characteristics and operations are inapplicable, e.g. we are unlikely to persist them onto a relational database, but many others are obviously still applicable: 

  • Naming of view model types and their members
  • Value constraints and validation
  • Change notifications
  • An inheritance hierarchy
  • An organization which is mostly hierarchical but also in many cases tabular
  • etc.

Furthermore, in that same MVVM application, (or in any GUI application for that matter,) the set of views also constitutes a separate data model. We do not normally think of GUI controls as entities of a data model, but nothing says that we cannot. If we try for a moment to view them like that, we notice that they fit the bill: 

  • The set of available GUI element types constitutes a schema.
  • A group of instantiated GUI elements is a finite set of interrelated data.
  • Element types and their properties are named.
  • The properties of each GUI element are entity fields.
  • GUI elements certainly have an inheritance hierarchy.
  • The parent-child relationship between GUI elements is in fact a referential relationship.
  • Notifications are issued when various things happen.
  • Serialization of the data model gives us a description of our GUI in markup; deserialization can allow us to reconstruct our GUI from markup.
  • If we view the data model in a generic textual way then we have what Visual Studio calls the "Live Visual Tree".
  • etc.

Data Models as a Universal API


The modern understanding of how to achieve interoperability across different applications or across highly isolated modules within a single application requires the meticulous specification of intricate custom APIs, and the laborious writing of complex custom code on either side of the API, to implement it on one side, and to consume it on the other side.

However, more often than not, all that this code does on either side of the API is to manipulate some custom data model:
  • The implementing side of an API usually serves each call by doing nothing but looking up information in a data model or updating the data model with new information. 
  • The consuming side of an API usually places API calls for no reason other than to construct its own subset-copy of the implementing side's data model, in its own terms.
This situation is so ubiquitous that it is hard to realize its full extent: even device control APIs, which are thought of as the par excellence command-oriented APIs, offer operations intended to place a device in various states, and usually have to also be able to report the actual state of the device, so they can in fact be thought of as doing nothing but manipulating and querying a data model, in this case the data model being the full device state.

Therefore, a large number of all these APIs that have been defined and are continuously being defined every day all over the world can be replaced with data model schemas, while all the code that is currently being written to implement them and to consume them can be either eliminated, or replaced with transformations between public (shared) data models and internal data models. Given a sufficiently rich data model framework, these transformations can be mostly declarative, rarely requiring custom code to be written, while the framework can take care of all the communication necessary in order to apply these transformations on either side, thus sparing us from having to deal with the actual nature of the communication, whether it is done by simple invocations within a common address space or by exchanging REST requests and responses across distant hosts on a network.


No comments:

Post a Comment