Fully managed in the cloudStarburst GalaxySelf-managed anywhereStarburst Enterprise
- Start Free
Fully managed in the cloud
Published: April 19, 2023
Fun fact: we all likely interact with databases on a daily basis. In fact, databases are a huge part of our increasingly data-driven world. They provide efficient data storage, search, and retrieval of large amounts of important data in organizations of all sizes.
Moreover, databases are not a new technology, and the needs that they fulfill are as old as computing itself. The first computerized database was created in the early 1960s for data management. Since then, technology has rapidly progressed. These days, with new kinds of technology, and new ways of recording information, data comes in all shapes and sizes. To address this, various formats for storing and accessing that data are necessary.
A database is a large collection of data organized, for rapid search and retrieval by a computer.
Let’s break down this definition further. A key concept when discussing databases is that unlike data warehouses, data within databases is an organized collection of data for rapid search and retrieval. Why? If an organization cannot quickly and efficiently access its data, the data becomes useless and limits decision-making. Computers are used for database management(DBMS) because they are designed to process data far more efficiently than humans.
Here are some ways you may be using database applications. As you read through the examples and look through the user interface, consider the definition above and how it relates.
In the past, people had to watch tv or read the newspaper to know the weather forecast. These days, it’s easy to just open your phone or computer and pull up a detailed forecast for the next ten days. Weather companies use complex prediction models that depend on various factors, and the data that feeds these models must be stored in databases. The demand for real-time weather predictions means that this data must be accessed efficiently.
Online video streaming services use databases to store many different types of data. Data is generated in many ways. For instance, your viewing history and preferences might be combined with data created when new content is released on the platform. Together, this creates a powerful digital trail and is used by the companies that own the platform to make recommendations. Considering all these factors, it is easy to imagine the massive scale of the data being managed by streaming services.
Beautiful user interfaces that accompany online shopping have become increasingly popular in recent years. Online retailers use databases to store product information including the name, price, quantity, availability, and description of each product. This data must be accurate and easy to retrieve quickly in order for an eCommerce site to function smoothly.
Although the purpose of a database is straightforward, the technology used to achieve these objectives is both varied and complex. All databases require certain basic architectural components to function. They are compute and storage (disk).
Compute controls computing and anything associated with it. Compute primarily includes CPU and RAM. CPU is the Central Processing Unit, or the chip that handles the computations, and RAM is the short term, temporary storage that the CPU uses to keep track of what it’s working on. Compute is where all of the database operations occur.
In other words, data in storage will remain there even after a machine has been turned off. If compute is the brain of the database, storage is analogous to books, notes, or other areas where information is stored outside of the brain.
The following video excerpt describes the basic architectural components of a database in more detail. Additionally, this video discusses two ways to increase a database’s server capacity: scaling up or scaling out.
Data within a database is often organized for rapid search and retrieval. But what does this really mean?
Depending on the purpose of a database, there are certain ways of structuring it that will allow a computer to efficiently find and gather relevant data. This can be illustrated with a simple example.
Imagine that you work as a receptionist at a doctor’s office that has only paper copies of patient medical records. Each time a patient comes in for an appointment, you must pull their chart to give to the nurse. Naturally, if the charts were alphabetized by patient name, this would make your job far easier and faster.
Similarly, the way you structure a database can make it easier for the computer to “do its job”. One such way of structuring a database, relationally, will be critically important to understand. Relational databases enables the development of complex relationships between different data sets, which is not possible with hierarchical databases.
A relational database is a system that organizes data in predefined relationships. It represents data as one or more tables composed of rows and columns. In database terminology, rows are often referred to as records, while columns are often referred to as fields. Each row represents an object and the information about that object. Each column represents one type of information about all of the objects in the table.
As such, data integrity is enforced in relational databases. Examples of a RDBMS include: MySQL, Oracle database, Microsoft SQL Server, and PostgreSQL.
Take a look at the image below for a visual representation of the terminology just introduced. The table contains data about students at a university. Each row represents one student, and each column represents one type of information about that student, such as first name, last name, and email address.
Where does the “relational” part come in?
Relational databases allow multiple tables to be linked in ways that mirror relationships held in the real world. A database schema defines how data is organized and connected within a relational database.
Consider the students table again. In reality, there are far more categories of student data than what is listed in the table. Some examples are their class schedule, grades, and professors. While you could simply put all of that information in the students table, you would end up with redundant information, which is something that is discouraged in relational databases.
Different categories of data are each given their own table, and then those tables are related to each other through a predefined structure, or schema. The relationships make it easier to query the data and join information. For example, if you wanted to list a particular student’s course schedule, you could easily find that information, even though it would have to be gathered from two different tables.
Let’s continue with the example students table and imagine that you are trying to build a relational database to store student information for a university.
The following video describes the process of designing tables to store this information and keys to distinguish between records in the tables. This video also defines the relationships among the tables in a database.
The video above provides a very simple example of relational database structure in practice. It provides a way of introducing you to the main principles at play when considering how this database type is used. Please use the callout below as a review of the main ideas you learned in the video.
Data generated from different sources is typically structured differently. You can imagine that data gathered from financial transactions will differ significantly from social media data. In the same way, data from scientific systems will be structured differently from marketing systems. In fact, the structure of a dataset varies almost as much as the data inside it. This means that the methods used to store and access data must also vary according to the requirements of the database in question. While relational databases have been around for quite some time, they are not the only option available, nor are they always the best option.
In this post, we outline 3 types of non-relational databases(also known as NoSQL databases):
Key-value pairs are a powerful data model used in a number of different scenarios. They store data as a set of pairings, each involving both a key and a value. Here, the two elements work in tandem to store information. The key serves as a unique identifier or path to the value. Keys can be anything, depending upon the restrictions imposed by the database software. The values in a key-value pair can be anything from a simple string to a complex object, such as a video or document. In a key-value database, key-value pairs are collected to produce dictionaries or unordered collections. They use compact, efficient index structures to quickly and reliably locate a value by its key. This is similar to the way you locate a word that you would like to define in a physical dictionary. Further hierarchy can be introduced by setting one set of key-value pairs inside another, creating a parent-child relationship. This is called nesting and allows further flexibility, and greater power.
The image below shows some examples of key-value pairs.
Simple key value example
This is an example of simple key-value pairs. The keys are both numeric and the values are strings. The keys will uniquely identify their values. For example, key 123 will point to 123 Main St.
More complex key value example
This is also a representation of a key-value pair, but it is more complex than the one above. This time, the key is a customer id, while the value is an object comprised of various, sometimes nested, information, such as customer name and billing address.
Key-value databases are best suited for applications that require many small, continuous reads and writes and for applications that don’t require frequent updates or need to support complex queries.
One such example is storing web session information. Each web session is unique, and therefore assigned a session id, which can serve as the key for the web session information.
The value can then be composed of all of the web session information in one object. This data is stored during the web session, and does not need to be updated afterward.
Another example use-case is for storing sensor data. This is common in industries like oil and gas where there are multiple sensors sending data for various purposes, resulting in continuous reads and writes. The constant flow of data from these sensors makes the simple and efficient storage offered by a key-value database ideal.
Examples of key-value databases include Redis, Riak, Hazelcast, Amazon DynamoDB.
The simple format of the data allows for accelerated read and write operations.
Since key-value stores always use primary-key access, they generally have great performance and can be easily scaled when reading individual key values is desired.
Key-value pairs lack object optimizations inherent to non-key-value pair systems. For example, pre-defined structures, indexes, and statistics cannot easily be handled with this data type.
Many key-value databases do not support ACID compliance, which is not ideal for transactions.
Although lack of schema can increase development speed, it can also potentially cause confusion and delay in the long run.
One way to organize databases breaks free from the traditional structure of rows and columns, just like object-oriented databases. Instead, data is stored as a series of documents. In this context a document is an object that encapsulates and encodes data in some standard format.
Typical formats include XML, YAML, and JSON. Each document is a self-contained data entry that includes all of the information needed to understand it.
Let’s explore this concept using an example. Suppose you had a vehicle database, and wanted to store information about the car pictured below.
If you were using a relational database, you would first have to decompose the car into its components in order to create separate tables for the body, the type of paint, the engine, and so on.
Before reading the data, you would then have to recompose everything, which is highly inefficient. This is where a document-oriented approach comes in handy. In the document-oriented model, information is stored in files, and there is no need to decompose or recompose anything, saving time and disk space.
Below is an example of what the record for the car might look like in JSON format. Notice that all the information is contained within one document. The data in the JSON document takes the format of field:value. The field name allows you to see what kind of data is being stored just by skimming through the document. Notice that nested values, objects contained within another object, are also allowed. For example, the “options” field contains five nested field:value pairs.
“audio”: “Bose 5.1”,
“wheels”: “Alloy 20”,
“dealer”: “Jones Acura”,
“address”: “3234 Main”,
“city”: “San Jose”,
“name”: “Patrick Sheehan”,
Document-oriented databases use a schema-on-read construct, which allows for flexibility in formatting. Suppose you wanted to store data for two entirely different vehicle types–the data for a snowmobile, alongside the data for the car. This would create a potential structure conflict because some of the options that are available for the car, such as audio and floor mats, are not available for a snowmobile. Because you are using a document-oriented database, this is not a problem. You can simply put the data that you have into a document.The format is adaptable enough to allow for this difference.
One of the challenges in using a document-oriented database is its lack of structured query language (SQL) support. It is often called “NoSQL” for this reason. Many of the tools that support document-oriented databases have added ways to query data from them using SQL or a SQL-like language. (Sidenote: SQL is certainly an easier programming language to learn than say, python.) But that requires enforcing schema rules on documents to make them conform to rows and columns, which eliminates their biggest advantages.
Examples of document databases include MongoDB, CouchDB, GemStone, CosmosDB, and DocumentDB.
Document-oriented databases dynamically mirror their application counterparts, which means they do not have to be deconstructed into relational objects. It also means design time is significantly decreased relative to relational/columnar databases.
Operations may be executed with little to no data manipulation, equating to lower I/O and vastly superior performance relative to their relational and columnar cousins.
They can scale-out as well as up.
There are many vendor options to choose from, and it can be difficult to find documentation for usage outside of the vendor’s own documentation.
This is only a disadvantage if you require structured, networked data.
The lack of structure in these databases makes them more difficult to query.
Graph databases store data in a completely different way. They use vector geometry to represent data as a series of nodes, edges, and properties. Examples of graph databases include Neo4J, OrientDB, ArangoDB, TitanDB. Let’s explore these concepts more directly.
Here, a node represents an entity or instance. Depending on the use case of the database in question, this could include data pertaining to people, businesses, accounts, or other items being tracked. In this sense, nodes are roughly the equivalent of a record, relation, or row in a relational database, or a document in a document-oriented database.
Information regarding the relationship between nodes is stored as edges (also termed graphs or relationships). You can think of edges as a line between two points on a map. In this analogy, each of these points will be represented by a different node.
In graph databases, properties record attributes associated with nodes. For example, imagine that you have a graph database that holds data pertaining to orders. Each order will be represented as a separate node. Each node will be linked to the properties of that order, including customer name, location, supplier name, and product name. In this example, the edges would describe the relationship between both the nodes themselves and properties attached to them. Notably, edges also describe the relationship between one property and another. In this way, meaningful patterns emerge when examining the connections and interconnections of nodes, properties, and edges.
The diagram below illustrates the key benefit of graph databases in action. Here, you can see the ability to handle complex, interconnected, or network data in a way that builds insights. Many common technologies make use of this approach. For example, LinkedIn uses graph databases to easily store information about an incredibly complex network of users. In this example, the nodes are the LinkedIn users, the properties are the information in a user’s profile, and the edges are connections between the users.
Key-value databases store the information without regard to the relationship among the key-value pairs, so in this example, there would be no way of knowing how the product, customer, supplier, and location relate without writing some code to define those relationships.
In relational databases, some of the relationships are predefined, but you are still required to write code (SQL) to get that information back.
Graph databases are built in a way that complex relationships are defined within the structure, making it easy to discover the relationship between networks and see emerging patterns.
Graph databases represent the relationships between elements as an inherent feature of their design. They do not require additional processing activity, such as the join operations required in relational databases.
Graph databases allow for fast identification of relationships.
Like most of their non-relational kin, graph databases are not a good choice for transactional systems.
The user base for graph databases is relatively small, which may cause problems with open-source support.
Data comes in various data structures (i.e. structured data vs. unstructured data) and so do databases. They come in all shapes, and sizes (i.e. relational vs. non-relational; traditional on-premise databases vs. cloud databases), each with its own set of advantages and disadvantages.
When choosing a database type, organizations must weigh the advantages and disadvantages according to their goals as well as end users. Consult with your organization’s database administrators as they use specialized DBMS software and data security to safely and securely store and organize data. In many cases, this results in organizations utilizing multiple types of databases or distributed databases for different purposes.
Learn about database management systems, relational databases, non-relational databases and so much more on Starburst Academy!
Up to $500 in usage credits included