Modelling of Graph Databases

Comparing graph databases with traditional, e.g., relational databases, some important database features are often missing there. Particularly, a graph database schema including integrity constraints is mostly not explicitly de ned, also a conceptual modelling is not used. It is hard to check a consistency of the graph database, because almost no integrity constraints are de ned or only their very simple representatives can be speci ed. In the paper, we discuss these issues and present current possibilities and challenges in graph database modelling. We focus also on integrity constraints modelling and propose functional dependencies between entity types, which reminds modelling functional dependencies known from relational databases. We show a number of examples of often cited GDBMSs and their approach to database schemas and ICs speci cation. Also a conceptual level of a graph database design is considered. We propose a su cient conceptual model based on a binary variant of the E-R model and show its relationship to a graph database model, i.e. a mapping conceptual schemas to database schemas. An alternative based on the conceptual functions called attributes is presented.


Introduction
There are several application domains in which the data has a natural representation as a graph.
Well-known applications using graph data structures include particular elds of the Semantic web, i.e. RDF data, Linked data, and other graph-oriented data as social networks and information networks. Inuence of graph technologies is noticeable in areas like geography, spatial objects, and semi-structured data (e.g. XML).
Graph databases are useful for storage and processing sensor networks in logistics, protein interaction pathways in life sciences, etc. A graph database (GDB) is any storage system that uses graph structures to represent and store data. An associated new brand category of data stores is called graph database management systems (GDBMS). Today, due to some their special properties, GDBMS belong among so called NoSQL databases.
Similarly to traditional databases, GDBs are based on a (logical) data model. Such a model is characterized by the following three features: • Manipulation of data is expressed by graph transformations like graph-oriented operations and type constructors.
Most of the graph database models proposed in the literature ignore at least one of the three components of a complete database model. Particularly, there has been much less work devoted to the formalisms other than graph reachability patterns or, e.g., the ICs such as labels with unique names, typing constraints on nodes, functional dependencies, and domain and range of properties [1].
A graph database schema reecting the above features consists of three components: a set of data structure types, a set of operators or inference rules, and a set of ICs (often called only constraints). Often we talk about the logical schema of a GDB in this context. Formally, ICs are statements about a GDB which must be satised. In eect, a GDB can be considered as an instance of its schema. For a GDB designer, the logical schema refers to the organization of data, which describes how the database will be constructed. Graph schemas are also appropriate tools to understand and visualize the data in the GDB.
Current commercial GDBs still need more improvements to meet these traditional dnitions.
A graph database model is usually not presented explicitly, but it is hidden in constructs of data denition language (DDL) which is at disposal in the given GDBMS. Especially the IC possibilities and a declarative language for online querying of graph data are either limited or completely lacking. Also the notion of database schema is often understood in other way than usually. Many graph database vendors have opted to either support a weaker notion of schema or to avoid it entirely. For example, a Titan [21] database schema can either be explicitly or implicitly dened. The schema is dened implicitly when it is rst used during the addition of an edge, node or the setting of a property. GDBMS Orient [22] [4] or Big Graph applications (e.g., [5]). In cite6 we discuss limitations of graph databases, but without consideration of their conceptual properties including ICs. These c 2017 Journal of Advanced Engineering and Computation (JAEC) 5 VOLUME: 1 | ISSUE: 1 | 2017 | June parts of graph database technology are discussed in [7]. Some papers partially compare graph database models used in various commercial GDBMS (e.g., [3], [8] Objective and contribution. In the paper, we discuss issues and current possibilities and challenges in graph database modelling. Also a conceptual level of a graph database design is considered. We propose a sucient conceptual model and show its relationship to a graph database model.
We will also use a functional approach to a database modelling in which a database graph is represented by so called attributes, i.e. typed partial functions [10]. We use for this approach the HIT Database Model, see, e.g., [12], as a functional alternative variant of E-R model.
Then a typed lambda calculus can be used as a data manipulation language. This approach reects the graph structure of a GDB and, on the other hand, provides powerful possibilities for dealing with properties in querying the GDB content. The paper is an extension of the work [7].
The rest of the paper is organized as follows.
In Sec. Then, the denition of a database graph is as follows: Denition 1 : A database graph G = (V, E, N, Σ, ρ, A, Att) is a directed, labelled, attributed multigraph, where V is a nite set of nodes with identiers drawn from an innite alphabet N, E is a set of edges, and ρ is an incidence function mapping E to V ×V . Node identiers are called also labels (node labels). The edge labels are drawn from the nite set of symbols Σ, and α is an edge labelling function mapping E to Σ. A is a set of attributes (properties ) represented by couples (A i , value i ). Att is a mapping assigning to each node/edge a subset (possibly empty) of attributes from A.

Graph database modelling
In a native graph database model, both the schema and its instances are modelled as graphs.
Nodes and edges are rst-class citizens. The model equips users by data as well as graph topology-aware data manipulation operators. Example 1 : Suppose entity types Language, Teacher, and Town.

Relationship types
Teaches and Is_born_in describe teaching and to be born (in a town), respectively. An associated graph database schema is depicted in Fig. 2. Figure 1 shows an instance of this schema. We could suppose the following prop-    The strings Language, Teacher, and Town as well as Teaches and Is_born_in can be used both for labeling in the graph database schema and in the associated GDB.
Because the human perceptual system is much more adept in working with graph data structures a good visualization is indispensable for GDB processing [5]. Authors of [13] mention the Neoclipse editor of Neo4j enabling visualizing and altering a GDB. Because a graph database schema is again database graph, it seems possible to use such tools for graph database modelling.
One can observe that the graph database schema 1 may not be sucient for application using the GDB in Fig. 1.
Obviously, each teacher can teach more languages and each teacher is born exactly in one town. These ICs should be already revealed at a conceptual level. Thus, there is M:N cardinality between languages and teachers, which is not expressed in Fig. 2. Then, a more sophisticated description would be needed. How can we expect, the answer is in the conceptual modelling and a conceptual schema designed for the GDB (see Sec. 4).

Integrity Constraints
In the case of the existence of a graph database schema, schema-instance consistency is required [14]. As in traditional databases, ICs provide a mechanism for capturing the semantics of the Mostly the following ICs are studied [3]: • types checking, • node/edge identity, to verify that an entity or a relationship can be identied by either a value (e.g., name or ID) or the values of its attributes; • referential integrity, to test that only existing entities are referenced; • cardinality checking, to verify uniqueness of properties or relationships; • functional dependencies, to test that an element in the graph determines the value of another; • graph pattern constraints, to verify a structural restriction (e.g., path constraints). A natural and useful IC is a functional dependency (FD) on graph nodes and edges. For example, Yu and Hein [16] proposed a valueclustered graph functional dependency for RDF data. Comparing to FDs known in the relational data model (see, e.g., [17]), in a GDB FDs require a special approach. An oriented edge in the graph database schema does not necessarily denotes a FD, e.g., Teaches in Fig. 2, and otherwise Is_born_in does. It means that FD specication has to be conceived as a formulation of explicit ICs on the database level.
Due to the fact that graph database schemas are multigraphs, FD description needs node and edge labels, and direction, e.g., Teacher → Is_born_in Town denotes such FD.
Often, FDs can be found for some edges coming from non-functional relationships, e.g., Teaches. In associated domain there is a rule, that teachers older than 70 teach at most one language. Such a dependency can be specied as e.g., as Teacher(Birth_year > 1994) → Teaches Language. We call such FDs conditional functional dependencies. Generally, they are de- In practice, an important problem is that GDBs might be inconsistent, i.e., the database might fail to satisfy all ICs.
In the case of GDB applications, such inconsistencies appear due to interoperability and graph distribution. setReadonly(), Not Null: setNotNull(), and Unique.
The role of graph database schema can be precisely specied in OrientDB: • schema-full -enables strict-mode at a classlevel and sets all elds as mandatory.
• schema-less -enables classes with no properties. Default is non-strict-mode, meaning that records can have arbitrary elds.
• schema-hybrid -enables classes with some elds, but allows records to dene custom elds. This role is also sometimes called schema-mixed.  Fig. 3(a.). For this purpose, the min-max ICs are usable. In this model variant, min-max cardinalities are expressed using the crow's foot notation used for the start node and the end node of some edges (see, Fig. 3(b.)). A straight and dotted line express mandatory and optional relationship, respectively. Min-max ICs could be expressed equivalently by expressions (E1 : (a, b), E2 : (c, d)), where a, c ∈ {0, 1}, b, d ∈ {1, N }, and N means any number greater than 1. Weak entity types are identication-and existence-dependent on some other entity type.
Suppose a weak entity type E W with a partial identication key that distinguishes instances of E W that are related to the same instance of a strong entity type E. The full identication key of E W then has to include the identication key of E. In Fig. 4 , Fig. 6). Consequently their full identication key will be #Husband_ID, #Wife_ID, #Date, where a referential integrity exists in Loan_app, i.e. #Husband_ID ⊆ #Person_ID and, similarly, #Wife_ID ⊆ #Person_ID. On the other hand, the associated graph database schema in Fig. 7 will be simpler, due to the one-way orientation and union of partial keys. Somebody could ask why only one edge label is used in the graph database schema 3. Obviously, two edges will lead from each Loan_app node in a GDB instance. This should be ensured  Denition 2 : A graph conceptual schema in the binary E-R model is 4-tuple < E, R, H, CC >, where E is a set of entity types, each of them is given by its name E i and a set of attributes A Ei . One or more attributes from A Ei determine the identication key K Ei of E i . R is a set of binary relationship types, while each relationship type R is given by a couple (E i1 , E i2 ) and a set of attributes A R . There are two inverse relationship names for each relationship type. If E i1 = E i2 for R, then such relationship type is called recursive. H is a set of ISA-hierarchies of entity types, and CC is a set of ICs.
There is a set E W ⊂ E (possibly empty) of weak entity types. For each weak entity type E W there is at least one sequence E 1 , . . . , E s , such that E 1 = E, E i−1 is identication de-pendent on E i , i = 2, . . . , s − 1, and E s is a strong entity type. Identication key of E W is the union of all partial and complete identication keys from this sequence. In each ISA- • entity type E is the source of H E with identication key K E , • the graph associated to H E is a tree with the root E, • there is no hierarchy H E ∈ H such that the tree associated to H E is a subtree of tree, which is associated to hierarchy H E , except of the case, when H E has only a root.
For each relationship type R ∈ R there are two min-max ICs in CC and vice versa, to each min-max IC from CC there is at least one relationship R in R having this IC as its min-max IC.
Conceptually, other generic relationship types, e.g., is-part-of relationships, could be considered in the binary E-R model. They can be described simply with graph conceptual constructs as well.

Mapping conceptual schemas to database schemas
A correct graph conceptual schema may be mapped into an equivalent (or nearly equivalent) graph database schema with the straightforward mapping algorithm but with a weaker notion of a database schema, i.e. some inherent ICs from the conceptual level will be neglected to satisfy usual notation of directed, labelled, attributed multigraphs. Then the mapping algorithm transforming a graph conceptual schema C into a graph database schema D can be described by the following rules:  Fig. 8. We can observe that the pattern is a generalization of a conditional functional dependency. A signicant problem is how to use these patterns in practice, reminding that the problem of graph matching using subgraph isomorphism is known to be NP-complete.

Functional approach graph conceptual modelling
A conceptual modelling can be based on the notion of attribute viewed as an empirical typed function that is described by an expression of a natural language [12]. A lot of papers are devoted to this approach studied mainly in 90ties (see, e.g., [20]).

Types
A hierarchy of types is constructed as follows.
We assume the existence of some (elementary ) types S 1 , ..., S k (k ≥ 1). They constitute a base B. More complex types are constructed in the following way.
The set of types T over B is the least set containing types from B and those given by (i)-(ii). When S i in B are interpreted as non-empty sets, then (S : R 1 , ..., R n ) denotes the set of all (total or partial) functions from R 1 × ... × R n into S, (R 1 , ..., R n ) denotes the Cartesian product R 1 × ... × R n .
The elementary type Bool = {TRUE, FALSE} is also in B. The type Bool allows to type some objects as sets and relations. They are modelled as unary and n-ary characteristic functions, respectively. The notion of a set is then redundant here.
The fact that X is an object of type R ∈ T can be written as X/R, or "X is the R−object".  Then, the associated typed lambda calculus with applications of functions and lambda abstractions provides a powerful tool for querying graph data conceived as functions [10].

Conclusion
In this paper, we proposed an approach to modelling GDBs based on a classical technique, here a binary variant of the E-R model, known from the world of relational DBMSs. We also proposed rather non-traditional functional approach to modelling graph data based on the notion of attribute. Attributes are conceptual objects with extension enabling to conceive a property graph as a set of functions.
We used the notions of graph conceptual model and graph database model. We also discussed relationships between schemas in both models, particularly the transformation of a graph conceptual schema to a graph database schema. Comparing to similar approaches in the world of relational DBMSs, the resulted schema is not given uniquely in this approach, both in terms of graph structure and ICs. We discussed also some types of ICs reminding functional dependencies known from a relational theory. Both graph data modelling and ICs formulation are yet maturing and oer an interesting theme for future research.