Google Cloud Datastore

Google Cloud Datastore is a NoSQL database service provided by Google Cloud Platform. It is a fully managed database which can handle massive amounts of data and it is a part of the many services offered by Google Cloud Platform. It is designed to handle structured data (mostly document based like JSON format) and it also offers a high reliability and efficient platform to create scalable applications. Unlike traditional relational databases, this is a schema-less database concept. This gives flexible data modeling and dynamic schema changes without downtime in its services that rely on this database. Google Cloud Datastore is platform used for data handling on mobile apps, web applications, and also the IoT systems. This is because of its key characteristics such as automatic scaling, strong consistency, and smooth integration with other Google Cloud services. Google Cloud Datastore is built to handle software applications that are require high scalability, low-latency reads and writes, and automatic management of data across distributed systems. Google Cloud Datastore organizes data in entities and properties, where entities are grouped into kinds. This concept is similar to tables in relational databases, however since this is NoSQL database, it is without the schema constraints. Each entity in Datastore is uniquely identified by a key. This key can have a custom user-defined identifier or it can be auto generated key by the system.

Google Cloud Datastore offers an API and client libraries for different types of general purpose programming languages, like Python, Java, and Node.js. This API also has different release versions of these languages, so that Cloud Datastore can be integrated with both legacy and modern apps written in these languages. It also provides support for asynchronous operations. With this, developers can build non-blocking and highly responsive systems. In the context of data consistency, Google Cloud Datastore provides strong consistency for single entity lookups and supports eventual consistency for queries across multiple entities.

History

Google Cloud Datastore was announced on April 11, 2013, as a fully managed NoSQL document database and designed to support large-scale web and mobile applications. It was based on the original Datastore used in Google App Engine since 2008. But it was designed to offer feature such as scalability, higher availability, and automatic data replication across multiple data centers.

Before the launch of Cloud Datastore, developers on Google App Engine used to work on a built-in Datastore that only worked with App Engine apps. When Google Cloud Platform started to grow in the market, developers wanted a database which they could use outside of App Engine to integrate with their apps. They needed more flexibility and wide availability. Cloud Datastore met this need by adding features like automatic sharding, indexing, and support for eventual consistency.

Google launched Cloud Firestore in 2018. It was a new NoSQL database with features such as real-time updates, offline support, and faster query execution. It was supposed to replace Cloud Datastore. New users were encouraged to use Firestore instead. However, since many developers were still creating applications that were using Datastore, Google decided to rename it to "Datastore mode in Firestore" in 2020. By doing so, existing users could continue using the familiar Datastore features with an option to upgrade to Firestore later when needed.

Even after the rise of Firestore, Cloud Datastore is still widely used in legacy applications, especially those apps that need a managed NoSQL database with strong multi-region replication and automatic scaling. Google Cloud continues to support these legacy systems so that they remain reliable and fully functional even in the future.

Overview

Access and management

Users can use the database in Google Cloud Datastore using the Google Cloud Console, the gcloud command-line tool. They can also use client libraries for different programming languages. Based on the user need, they can choose to either use graphical interface or writing a code to interact with the database.

Data organization

Data is organized into entities in Google Cloud Datastore. These are like individual records. These entities are grouped into kinds. This is just like tables in a traditional database. However, unlike relational databases, entities in the same kind do not have to follow a fixed structure (like a pre-defined schema). They can have different sets of properties.

Entities and properties

Each entity represents a structured set of properties. Properties are key-value pairs. Examples of values can be strings, numbers, booleans, timestamps, arrays, and geographic points. The flexible nature of properties allows developers to model complex data structures without a rigid schema.

Entity keys

Every entity in Datastore is uniquely identified by a key. A key includes:

A project ID (the identifier of the Google Cloud project),
An optional namespace (used for multi-tenancy),
A kind (defining the entity type),
And a name or numeric ID that distinguishes the entity within its kind.

Entities can optionally be grouped into entity groups to allow for transactional updates involving multiple entities.

Data types

Google Cloud Datastore supports a range of property data types as shown below:

String
Integer
Float
Boolean
Timestamp
Array (List)
Embedded Entity
GeoPoint (geographical coordinates)
Binary data (Blob)

This variety allows modeling of structured and semi-structured data into various datatypes.

Querying and indexes

Datastore automatically indexes each property to enable efficient querying. For more complex queries involving multiple properties, composite indexes can be defined manually. Index management is typically handled through an index.yaml file or through the console.

GQL

GQL (Google Cloud Datastore Query Language) is a query language just like SQL and it is designed to interact with Google Cloud Datastore. GQL allows users to query the Datastore service using a statements just like SQL, however specifically designed to the NoSQL nature of this platform. GQL provides ways to filter, order, and perform operations on Datastore entities without needing to write complex queries in the underlying datastore APIs.

Unlike SQL, GQL is limited in terms of the types of joins and relationships it can handle. However, it supports querying by properties, including equality and inequality, as well as range queries. Users can use GQL to query entities based on multiple conditions. This makes GQL suitable for a wide range of use cases such as retrieving user data, product catalogs, and even updating the database.

GQL also has support for ancestor queries. This lets users to get related entities based on their place in a hierarchy. This is very much needed for applications where we need to manage hierarchical data like a content management systems or data models that have parent-child relationships. Even though GQL can help in simplifying querying, it operates within the constraints of Datastore's eventual consistency model.

Example GQL Query:

SELECT * FROM Task WHERE status = 'completed' AND priority = 'high' ORDER BY created DESC

This query will get all Task entities with a status of "completed" and a priority of "high". It will also order them by the created timestamp with descending order.

Even though GQL has an easy-to-use interface for querying Google Cloud Datastore, when dealing with more complex queries and joins, we need to use Datastore's native APIs such as the Google Cloud Datastore API. This API has greater flexibility and control for developers. For example, we cannot do below join operation:

SELECT * FROM Customer JOIN Order ON Customer.customer_id = Order.customer_id

We can implement a logic like below code. It is basically a two step process. First, use the Datastore API to fetch the customer by ID. Then, use that customer ID to create another query to retrieve related orders.

from google.cloud import datastore

client = datastore.Client()

#step 1
customerkey = client.key('Customer', 'C123456')
customer = client.get(customerkey)

#step 2
query = client.query(kind='Order')
query.add_filter('customer_id', '=', 'C123456')
orders = list(query.fetch())

Best practices for developers

Indexing strategy

As a default design, Cloud Datastore automatically indexes all properties in each entity to support faster querying. However, when dealing with complex queries that have multiple filters or sort orders, it is required to have composite indexes defined manually in an index.yaml file. Developers are required to review query plans and manage indexes carefully, otherwise unnecessary indexes can lead to increased write latency and storage costs. This is detrimental to performance, so it has to be avoided.

Query limitations

Datastore does not support joins, subqueries, or aggregation operations like those found in relational databases, such as MS SQL and MySQL. Because of this, application design often requires denormalization. It is a process of storing related data together within a single entity or using entity groups to maintain hierarchical relationships. Query filters are required to match existing indexes, and certain combinations of inequality and sort operations may require custom indexes.

Transactions and consistency

Cloud Datastore supports ACID transactions for operations on entities within a single entity group. This enables safe updates to related data, such as parent and child entities. It is noted that single-entity lookups and ancestor queries are strongly consistent, however general queries across multiple entity groups offer eventual consistency.

Language and API support

Google Cloud Datastore offers RESTful interface and a gRPC API. This is quite useful for developers who need to design distributed applications. These APIs offer direct access to Datastore's features such as transactions, queries, and entity management. The client libraries are built on top of these APIs and hide complex details. They make it easier to use different languages such as Python, Java, Go, Node.js, and C# by handling things like connection handling, retries, and serialization. Datastore's API is also optimized for giving high-throughput and low-latency access. This API also supports batch operations, ancestor queries, and strongly consistent reads within entity groups. With the help of such features, developers can build scalable applications without the need of handling complex database infrastructure on their own.

Query execution

When a user send request to Google Cloud Datastore for any database command like write (insert, update, delete) or a read (get or query), the GCP system follows a predefined sequence of steps to perform the required task while also maintaining faster performance, partitioning, and consistency.

Write commands (put, update, delete)

As opposed to traditional relational systems that parse SQL queries and optimize execution plans, GCP Datastore uses a NoSQL document-style architecture. Each entity is uniquely identified by a key composed of a kind, an identifier (or name), and optionally a parent key.

The very first step in a write operation is to authenticate and check the user's IAM (Identity and Access Management) permissions. After the authorization is verified and no issue is found, using a key-based partitioning strategy, the request is routed to the correct entity group and then to a specific Datastore node . The entity's key decides which partition is supposed to handle the request.

If the write targets a single entity group, Datastore offers strong consistency. Writes are serialized and applied in a consistent manner. However, when the write deals with multiple entity groups, it relies on eventual consistency, and transactions or batched writes must be used to handle the operation.

Datastore relies on Google Cloud’s Spanner infrastructure for high data durability and high availability. Each write command is duplicated across multiple zones of the selected region. However this practice is abstracted away from developers. Which means, developers need not get into the technical details that goes behind the database engine while executing the write commands. Each write operation automatically updates and takes care of all relevant composite and single-property indexes.

Read commands (get, query)

Read commands like get() and structured queries (via query()) work based on how data is stored and indexed. At the time of performing an entity lookup, a get() call uses an entity key to directly get access to the appropriate partition and retrieve the entity. However, if the entity belongs to the same entity group and no replication delay occurs, this read operation is strongly consistent.

At the time of query execution, a structured query that involves filtering or sorting translates into an index lookup instead of a full table scan. The query engine uses the relevant indexes to identify entity keys that match the query criteria and then retrieves the full entities.

GCP Datastore has two consistency modes for queries. The first is strong consistency, which applies to ancestor queries or direct key lookups within a single entity group. While the second mode, eventual consistency applies to most queries across unrelated entities or entity groups, where the system sacrifices consistency for higher availability.

The function of query planner is to decide how to execute a given query. Since Datastore does not support joins like traditional relational databases, developers have to perform multiple discrete queries and manually combine all the result outputs. This process is called as "application-side joins." Additionally, the query planner checks whether a given query can be served using existing indexes. If no appropriate index is found, the query will fail unless the developer has pre-defined the index in a separate index.yaml file.

Data modeling trade-offs

Google Cloud Datastore is designed to scale easily and simplify operations, but this comes with some unavoidable trade-offs. Especially when it comes to features like joins and transactions involving multiple rows. To achieve high scalability, Datastore requires developers to denormalize their data and design their schema around read patterns, rather than focusing on normalization. This approach helps reduce complexity and enhances performance, especially when developers are working on read-heavy applications. Hence, at the time of designing database, developers must plan their data models with the assumption that Cloud Datastore won't offer join operations. For even more complex operations such as aggregations or full-text search, developers often need to integrate Datastore with other services like GCP BigQuery for analytics or Elasticsearch for search capabilities. Which means, developers need to design a system that connects multiple tools to achieve their goals.

Comparison with other databases

Cloud Datastore vs. MongoDB

Cloud Datastore and MongoDB are both NoSQL document databases, but still they differ in terms of their structure and capabilities. Cloud Datastore enforces a more structured document model with entity groups and prioritizes scalability and simplicity over rich querying features. Whereas MongoDB gives a flexible document model (BSON), which is a powerful aggregation pipeline, and has limited join capabilities. Cloud Datastore is tightly combined with Google Cloud Platform (GCP), but MongoDB can be self-hosted or used as a managed service via MongoDB Atlas on different cloud platforms.

Cloud Datastore vs. Relational Databases

Relational databases like MySQL and PostgreSQL follow a normalized schema with support for complex joins and full ACID transactions. These are ideal for making applications requiring strong consistency and relational logic. In contrast, Cloud Datastore follows a denormalized and schema-less NoSQL approach, where data is structured around access patterns to optimize read performance at scale. Traditional RDBMS are better for transactional systems and complex queries and Cloud Datastore is more effective for large-scale applications that need high availability and horizontal scalability.

Cloud Datastore vs. Amazon DynamoDB

Cloud Datastore and Amazon DynamoDB are both fully managed NoSQL databases from their respective cloud providers, GCP and AWS. DynamoDB provides tunable consistency, automatic scaling, and support for multi-item ACID transactions. Cloud Datastore, supports strong consistency only for certain types of queries (ancestor queries) and has more limitations on transactional support. Both of them discourage the use of joins. DynamoDB requires explicit configuration for secondary indexes, whereas Cloud Datastore handles indexing automatically.

Cloud Datastore vs. Redis

Redis is also a NoSQL database with in-memory key-value store designed for extremely quick access. It is an ideal choice for caching, pub/sub systems, and real-time analytics. It prioritizes speed over durability, but persistence options are available as well. Cloud Datastore is a persistent and disk-based document database that has structured queries. It is also optimized for long-term data storage and retrieval. Redis is great for fast and transient operations, but Cloud Datastore is better suited for structured and long-lived application data that need managed scaling and indexing. It is to be noted that Redis is often used in combination with a database like Cloud Datastore not as a replacement.

References

External links