Databases_and_file_system

 

  1. The relational model

Relational data model is the primary data model, which is used widely around the world for data storage and processing. This model is simple and it has all the properties and capabilities required to process data with storage efficiency.

Concepts

Tables − In relational data model, relations are saved in the format of Tables. This format stores the relation among entities. A table has rows and columns, where rows represents records and columns represent the attributes.

Tuple − A single row of a table, which contains a single record for that relation is called a tuple.

Relation instance − A finite set of tuples in the relational database system represents relation instance. Relation instances do not have duplicate tuples.

Relation schema − A relation schema describes the relation name (table name), attributes, and their names.

Relation key − Each row has one or more attributes, known as relation key, which can identify the row in the relation (table) uniquely.

Attribute domain − Every attribute has some pre-defined value scope, known as attribute domain.

Constraints

Every relation has some conditions that must hold for it to be a valid relation. These conditions are called Relational Integrity Constraints. There are three main integrity constraints −

  • Key constraints

  • Domain constraints

  • Referential integrity constraints

Key Constraints

There must be at least one minimal subset of attributes in the relation, which can identify a tuple uniquely. This minimal subset of attributes is called key for that relation. If there are more than one such minimal subsets, these are called candidate keys.

Key constraints force that −

  • in a relation with a key attribute, no two tuples can have identical values for key attributes.

  • a key attribute can not have NULL values.

Key constraints are also referred to as Entity Constraints.

Domain Constraints

Attributes have specific values in real-world scenario. For example, age can only be a positive integer. The same constraints have been tried to employ on the attributes of a relation. Every attribute is bound to have a specific range of values. For example, age cannot be less than zero and telephone numbers cannot contain a digit outside 0-9.

Referential integrity Constraints

Referential integrity constraints work on the concept of Foreign Keys. A foreign key is a key attribute of a relation that can be referred in other relation.

Referential integrity constraint states that if a relation refers to a key attribute of a different or same relation, then that key element must exist.


  1. ER model

The ER model defines the conceptual view of a database. It works around real-world entities and the associations among them. At view level, the ER model is considered a good option for designing databases.

Entity

An entity can be a real-world object, either animate or inanimate, that can be easily identifiable. For example, in a school database, students, teachers, classes, and courses offered can be considered as entities. All these entities have some attributes or properties that give them their identity.

An entity set is a collection of similar types of entities. An entity set may contain entities with attribute sharing similar values. For example, a Students set may contain all the students of a school; likewise a Teachers set may contain all the teachers of a school from all faculties. Entity sets need not be disjoint.

Attributes

Entities are represented by means of their properties, called attributes. All attributes have values. For example, a student entity may have name, class, and age as attributes.

There exists a domain or range of values that can be assigned to attributes. For example, a student's name cannot be a numeric value. It has to be alphabetic. A student's age cannot be negative, etc.

Types of Attributes

  • Simple attribute − Simple attributes are atomic values, which cannot be divided further. For example, a student's phone number is an atomic value of 10 digits.

  • Composite attribute − Composite attributes are made of more than one simple attribute. For example, a student's complete name may have first_name and last_name.

  • Derived attribute − Derived attributes are the attributes that do not exist in the physical database, but their values are derived from other attributes present in the database. For example, average_salary in a department should not be saved directly in the database, instead it can be derived. For another example, age can be derived from data_of_birth.

  • Single-value attribute − Single-value attributes contain single value. For example − Social_Security_Number.

  • Multi-value attribute − Multi-value attributes may contain more than one values. For example, a person can have more than one phone number, email_address, etc.

These attribute types can come together in a way like −

  • simple single-valued attributes

  • simple multi-valued attributes

  • composite single-valued attributes

  • composite multi-valued attributes

Entity-Set and Keys

Key is an attribute or collection of attributes that uniquely identifies an entity among entity set.

For example, the roll_number of a student makes him/her identifiable among students.

  • Super Key − A set of attributes (one or more) that collectively identifies an entity in an entity set.

  • Candidate Key − A minimal super key is called a candidate key. An entity set may have more than one candidate key.

  • Primary Key − A primary key is one of the candidate keys chosen by the database designer to uniquely identify the entity set.

Relationship

The association among entities is called a relationship. For example, an employee works_at a department, a student enrolls in a course. Here, Works_at and Enrolls are called relationships.

Relationship Set

A set of relationships of similar type is called a relationship set. Like entities, a relationship too can have attributes. These attributes are called descriptive attributes.

Degree of Relationship

The number of participating entities in a relationship defines the degree of the relationship.

  • Binary = degree 2

  • Ternary = degree 3

  • n-ary = degree

Mapping Cardinalities

Cardinality defines the number of entities in one entity set, which can be associated with the number of entities of other set via relationship set.

  • One-to-one − One entity from entity set A can be associated with at most one entity of entity set B and vice versa.

One-to-one relation

  • One-to-many − One entity from entity set A can be associated with more than one entities of entity set B however an entity from entity set B, can be associated with at most one entity.

One-to-many relation

  • Many-to-one − More than one entities from entity set A can be associated with at most one entity of entity set B, however an entity from entity set B can be associated with more than one entity from entity set A.

Many-to-one relation

  • Many-to-many − One entity from A can be associated with more than one entity from B and vice versa.

Many-to-many relation


  1. SQL

What is SQL?

SQL is Structured Query Language, which is a computer language for storing, manipulating and retrieving data stored in a relational database.

SQL is the standard language for Relational Database System. All the Relational Database Management Systems (RDMS) like MySQL, MS Access, Oracle, Sybase, Informix, Postgres and SQL Server use SQL as their standard database language.

Also, they are using different dialects, such as −

  • MS SQL Server using T-SQL,

  • Oracle using PL/SQL,

  • MS Access version of SQL is called JET SQL (native format) etc.

Why SQL?

SQL is widely popular because it offers the following advantages −

  • Allows users to access data in the relational database management systems.

  • Allows users to describe the data.

  • Allows users to define the data in a database and manipulate that data.

  • Allows to embed within other languages using SQL modules, libraries & pre-compilers.

  • Allows users to create and drop databases and tables.

  • Allows users to create view, stored procedure, functions in a database.

  • Allows users to set permissions on tables, procedures and views.

A Brief History of SQL

  • 1970 − Dr. Edgar F. "Ted" Codd of IBM is known as the father of relational databases. He described a relational model for databases.

  • 1974 − Structured Query Language appeared.

  • 1978 − IBM worked to develop Codd's ideas and released a product named System/R.

  • 1986 − IBM developed the first prototype of relational database and standardized by ANSI. The first relational database was released by Relational Software which later came to be known as Oracle.

SQL Process

When you are executing an SQL command for any RDBMS, the system determines the best way to carry out your request and SQL engine figures out how to interpret the task.

There are various components included in this process.

These components are −

  • Query Dispatcher

  • Optimization Engines

  • Classic Query Engine

  • SQL Query Engine, etc.

A classic query engine handles all the non-SQL queries, but a SQL query engine won't handle logical files.

Following is a simple diagram showing the SQL Architecture −

SQL Architecture

SQL Commands

The standard SQL commands to interact with relational databases are CREATE, SELECT, INSERT, UPDATE, DELETE and DROP. These commands can be classified into the following groups based on their nature −

DDL - Data Definition Language

Sr.No.

Command & Description

1

CREATE

Creates a new table, a view of a table, or other object in the database.

2

ALTER

Modifies an existing database object, such as a table.

3

DROP

Deletes an entire table, a view of a table or other objects in the database.

DML - Data Manipulation Language

Sr.No.

Command & Description

1

SELECT

Retrieves certain records from one or more tables.

2

INSERT

Creates a record.

3

UPDATE

Modifies records.

4

DELETE

Deletes records.

DCL - Data Control Language

Sr.No.

Command & Description

1

GRANT

Gives a privilege to user.

2

REVOKE

Takes back privileges granted from user.


  1. Functional dependency and relational database design

What is Functional Dependency

Functional dependency in DBMS, as the name suggests is a relationship between attributes of a table dependent on each other. Introduced by E. F. Codd, it helps in preventing data redundancy and gets to know about bad designs.

To understand the concept thoroughly, let us consider P is a relation with attributes A and B. Functional Dependency is represented by -> (arrow sign)

Then the following will represent the functional dependency between attributes with an arrow sign:

A -> B


Above suggests the following:

https://www.tutorialspoint.com/assets/questions/media/9700/functional_dependencies.png

Example

The following is an example that would make it easier to understand functional dependency:

We have a <Department> table with two attributes: DeptId and DeptName.

DeptId = Department ID
DeptName = Department Name


The DeptId is our primary key. Here, DeptId uniquely identifies the DeptName attribute. This is because if you want to know the department name, then at first you need to have the DeptId.

DeptId

DeptName

001

Finance

002

Marketing

003

HR


Therefore, the above functional dependency between DeptId and DeptName can be determined as DeptId is functionally dependent on DeptName:

DeptId -> DeptName


Types of Functional Dependency

Functional Dependency has three forms:

  1. Trivial Functional Dependency

  2. Non-Trivial Functional Dependency

  3. Completely Non-Trivial Functional Dependency

Let us begin with Trivial Functional Dependency:

Trivial Functional Dependency

It occurs when B is a subset of A in:

A ->B


Example

We are considering the same <Department> table with two attributes to understand the concept of trivial dependency.

The following is a trivial functional dependency since DeptId is a subset of DeptId and DeptName

{ DeptId,  DeptName } -> Dept Id


Non –Trivial Functional Dependency

It occurs when B is not a subset of A in:

A ->B


Example

DeptId ->  DeptName


The above is a non-trivial functional dependency since DeptName is a not a subset of DeptId.

Completely Non - Trivial Functional Dependency

It occurs when A intersection B is null in:

A ->B


Armstrong’s Axioms Property of Functional Dependency

Armstrong’s Axioms property was developed by William Armstrong in 1974 to reason about functional dependencies.

The property suggests rules that hold true if the following are satisfied:

  1. Transitivity
    If A->B and B->C, then A->C i.e. a transitive relation.

  2. Reflexivity
    A-> B, if B is a subset of A.

  3. Augmentation
    The last rule suggests: AC->BC, if A->B

Relational database design

These databases are categorized by a set of tables where data gets fit into a pre-defined category. The table consists of rows and columns where the column has an entry for data for a specific category and rows contains instance for that data defined according to the category. The Structured Query Language (SQL) is the standard user and application program interface for a relational database.

There are various simple operations that can be applied over the table which makes these databases easier to extend, join two databases with a common relation and modify all existing applications.

Database Design Objective

A well-designed database shall:

  • Eliminate Data Redundancy: the same piece of data shall not be stored in more than one place. This is because duplicate data not only waste storage spaces but also easily lead to inconsistencies.

  • Ensure Data Integrity and Accuracy:

  • [TODO] more

Relational Database Design Process

Database design is more art than science, as you have to make many decisions. Databases are usually customized to suit a particular application. No two customized applications are alike, and hence, no two database are alike. Guidelines (usually in terms of what not to do instead of what to do) are provided in making these design decision, but the choices ultimately rest on the you - the designer.

Step 1: Define the Purpose of the Database (Requirement Analysis)

Gather the requirements and define the objective of your database, e.g. ...

Drafting out the sample input forms, queries and reports, often helps.

Step 2: Gather Data, Organize in tables and Specify the Primary Keys

Once you have decided on the purpose of the database, gather the data that are needed to be stored in the database. Divide the data into subject-based tables. 

Choose one column (or a few columns) as the so-called primary key, which uniquely identify the each of the rows.

Step 3: Create Relationships among Tables

A database consisting of independent and unrelated tables serves little purpose (you may consider to use a spreadsheet instead). The power of relational database lies in the relationship that can be defined between tables. The most crucial aspect in designing a relational database is to identify the relationships among tables. The types of relationship include:

  1. one-to-many

  2. many-to-many

  3. one-to-one

step 4 Refine & Normalize the Design

For example,

  • adding more columns,

  • create a new table for optional data using one-to-one relationship,

  • split a large table into two smaller tables,

  • others.



C:\Users\Lenovo\AppData\Local\Microsoft\Windows\INetCache\Content.MSO\3B368899.tmp



  1. File structure

Relative data and information is stored collectively in file formats. A file is a sequence of records stored in binary format. A disk drive is formatted into several blocks that can store records. File records are mapped onto those disk blocks.

File Organization

File Organization defines how file records are mapped onto disk blocks. We have four types of File Organization to organize file records −

Heap File Organization

When a file is created using Heap File Organization, the Operating System allocates memory area to that file without any further accounting details. File records can be placed anywhere in that memory area. It is the responsibility of the software to manage the records. Heap File does not support any ordering, sequencing, or indexing on its own.

Sequential File Organization

Every file record contains a data field (attribute) to uniquely identify that record. In sequential file organization, records are placed in the file in some sequential order based on the unique key field or search key. Practically, it is not possible to store all the records sequentially in physical form.

Hash File Organization

Hash File Organization uses Hash function computation on some fields of the records. The output of the hash function determines the location of disk block where the records are to be placed.

Clustered File Organization

Clustered file organization is not considered good for large databases. In this mechanism, related records from one or more relations are kept in the same disk block, that is, the ordering of records is not based on primary key or search key.

File Operations

Operations on database files can be broadly classified into two categories −

  • Update Operations

  • Retrieval Operations

Other than creation and deletion of a file, there could be several operations, which can be done on files.

  • Open − A file can be opened in one of the two modes, read mode or write mode. In read mode, the operating system does not allow anyone to alter data. In other words, data is read only. Files opened in read mode can be shared among several entities. Write mode allows data modification. Files opened in write mode can be read but cannot be shared.

  • Locate − Every file has a file pointer, which tells the current position where the data is to be read or written. This pointer can be adjusted accordingly. Using find (seek) operation, it can be moved forward or backward.

  • Read − By default, when files are opened in read mode, the file pointer points to the beginning of the file. There are options where the user can tell the operating system where to locate the file pointer at the time of opening a file. The very next data to the file pointer is read.

  • Write − User can select to open a file in write mode, which enables them to edit its contents. It can be deletion, insertion, or modification. The file pointer can be located at the time of opening or can be dynamically changed if the operating system allows to do so.

  • Close − This is the most important operation from the operating system’s point of view. When a request to close a file is generated, the operating system


  1. Transaction management and concurrency control

A transaction is a logical unit of processing in a DBMS which entails one or more database access operation. In a nutshell, database transactions represent real-world events of any enterprise. 

All types of database access operation which are held between the beginning and end transaction statements are considered as a single logical transaction. During the transaction the database is inconsistent. Only once the database is committed the state is changed from one consistent state to another. 

https://www.guru99.com/images/1/100518_0500_DBMSTransac1.png

Transaction failure in between the operations

Now that we understand what is transaction, we should understand what are the problems associated with it.

The main problem that can happen during a transaction is that the transaction can fail before finishing the all the operations in the set. This can happen due to power failure, system crash etc. This is a serious problem that can leave database in an inconsistent state. Assume that transaction fail after third operation (see the example above) then the amount would be deducted from your account but your friend will not receive it.

To solve this problem, we have the following two operations

Commit: If all the operations in a transaction are completed successfully then commit those changes to the database permanently.

Rollback: If any of the operation fails then rollback all the changes done by previous operations.

Even though these operations can help us avoiding several issues that may arise during transaction but they are not sufficient when two transactions are running concurrently. To handle those problems we need to understand database ACID properties.


What are ACID Properties?

For maintaining the integrity of data, the DBMS system you have to ensure ACID properties. ACID stands for Atomicity, Consistency, Isolation, and Durability. 

  • Atomicity: A transaction is a single unit of operation. You either execute it entirely or do not execute it at all. There cannot be partial execution. 

  • Consistency: Once the transaction is executed, it should move from one consistent state to another. 

  • Isolation: Transaction should be executed in isolation from other transactions (no Locks). Isolation (that means no other transaction should run concurrently when there is a transaction already running).For every pair of transactions, one transaction should start execution only when the other finished execution.

  • Durability:· Once a transaction completes successfully, the changes it has made into the database should be permanent even if there is a system failure. The recovery-management component of database systems ensures the durability of transaction.


Concurrency Control

Concurrency control is the procedure in DBMS for managing simultaneous operations without conflicting with each another.

In a multiprogramming environment where multiple transactions can be executed simultaneously, it is highly important to control the concurrency of transactions. We have concurrency control protocols to ensure atomicity, isolation, and serializability of concurrent transactions. Concurrency control protocols can be broadly divided into two categories −

  • Lock based protocols

  • Time stamp based protocols

Why use Concurrency method?

Reasons for using Concurrency control method is DBMS: 

  • To apply Isolation through mutual exclusion between conflicting transactions

  • To resolve read-write and write-write conflict issues 

  • To preserve database consistency through constantly preserving execution obstructions 

  • The system needs to control the interaction among the concurrent transactions. This control is achieved using concurrent-control schemes.

  • Concurrency control helps to ensure serializability 

Lock-based Protocols

Database systems equipped with lock-based protocols use a mechanism by which any transaction cannot read or write data until it acquires an appropriate lock on it. Locks are of two kinds −

  • Binary Locks − A lock on a data item can be in two states; it is either locked or unlocked.

  • Shared/exclusive − This type of locking mechanism differentiates the locks based on their uses. If a lock is acquired on a data item to perform a write operation, it is an exclusive lock. Allowing more than one transaction to write on the same data item would lead the database into an inconsistent state. Read locks are shared because no data value is being changed.

There are four types of lock protocols available −

Simplistic Lock Protocol

Simplistic lock-based protocols allow transactions to obtain a lock on every object before a 'write' operation is performed. Transactions may unlock the data item after completing the ‘write’ operation.

Pre-claiming Lock Protocol

Pre-claiming protocols evaluate their operations and create a list of data items on which they need locks. Before initiating an execution, the transaction requests the system for all the locks it needs beforehand. If all the locks are granted, the transaction executes and releases all the locks when all its operations are over. If all the locks are not granted, the transaction rolls back and waits until all the locks are granted.

Pre-claiming

Two-Phase Locking 2PL

This locking protocol divides the execution phase of a transaction into three parts. In the first part, when the transaction starts executing, it seeks permission for the locks it requires. The second part is where the transaction acquires all the locks. As soon as the transaction releases its first lock, the third phase starts. In this phase, the transaction cannot demand any new locks; it only releases the acquired locks.

Two Phase Locking

Two-phase locking has two phases, one is growing, where all the locks are being acquired by the transaction; and the second phase is shrinking, where the locks held by the transaction are being released.

To claim an exclusive (write) lock, a transaction must first acquire a shared (read) lock and then upgrade it to an exclusive lock.

Strict Two-Phase Locking

The first phase of Strict-2PL is same as 2PL. After acquiring all the locks in the first phase, the transaction continues to execute normally. But in contrast to 2PL, Strict-2PL does not release a lock after using it. Strict-2PL holds all the locks until the commit point and releases all the locks at a time.

Strict Two Phase Locking

Strict-2PL does not have cascading abort as 2PL does.

Timestamp-based Protocols

The most commonly used concurrency protocol is the timestamp based protocol. This protocol uses either system time or logical counter as a timestamp.

Lock-based protocols manage the order between the conflicting pairs among transactions at the time of execution, whereas timestamp-based protocols start working as soon as a transaction is created.

Every transaction has a timestamp associated with it, and the ordering is determined by the age of the transaction. A transaction created at 0002 clock time would be older than all other transactions that come after it. For example, any transaction 'y' entering the system at 0004 is two seconds younger and the priority would be given to the older one.

In addition, every data item is given the latest read and write-timestamp. This lets the system know when the last ‘read and write’ operation was performed on the data item.

Timestamp Ordering Protocol

The timestamp-ordering protocol ensures serializability among transactions in their conflicting read and write operations. This is the responsibility of the protocol system that the conflicting pair of tasks should be executed according to the timestamp values of the transactions.

  • The timestamp of transaction Ti is denoted as TS(Ti).

  • Read time-stamp of data-item X is denoted by R-timestamp(X).

  • Write time-stamp of data-item X is denoted by W-timestamp(X).

Timestamp ordering protocol works as follows −

  • If a transaction Ti issues a read(X) operation −

    • If TS(Ti) < W-timestamp(X) 

      • Operation rejected.

    • If TS(Ti) >= W-timestamp(X) 

      • Operation executed.

    • All data-item timestamps updated.

  • If a transaction Ti issues a write(X) operation −

    • If TS(Ti) < R-timestamp(X)

      • Operation rejected.

    • If TS(Ti) < W-timestamp(X)

      • Operation rejected and Ti rolled back.

    • Otherwise, operation executed.

Thomas' Write Rule

This rule states if TS(Ti) < W-timestamp(X), then the operation is rejected and Ti is rolled back.

Time-stamp ordering rules can be modified to make the schedule view serializable.

Instead of making Ti rolled back, the 'write' operation itself is ignored.



  1. Crash Recovery


Transactions (or units of work) against a database can be interrupted unexpectedly. If a failure occurs before all of the changes that are part of the unit of work are completed, committed, and written to disk, the database is left in an inconsistent and unusable state. Crash recovery is the process by which the database is moved back to a consistent and usable state. This is done by rolling back incomplete transactions and completing committed transactions that were still in memory when the crash occurred.

If the database or the database manager fails, the database can be left in an inconsistent state. The contents of the database might include changes made by transactions that were incomplete at the time of failure. The database might also be missing changes that were made by transactions that completed before the failure but which were not yet flushed to disk. A crash recovery operation must be performed in order to roll back the partially completed transactions and to write to disk the changes of completed transactions that were previously made only in memory.

Failure Classification

To see where the problem has occurred, we generalize a failure into various categories, as follows −

Transaction failure

System Crash

Disk Failure

There are two types of techniques, which can help a DBMS in recovering as well as maintaining the atomicity of a transaction −

  • Maintaining the logs of each transaction, and writing them onto some stable storage before actually modifying the database.

  • Maintaining shadow paging, where the changes are done on a volatile memory, and later, the actual database is updated.

Log-based Recovery

Log is a sequence of records, which maintains the records of actions performed by a transaction. It is important that the logs are written prior to the actual modification and stored on a stable storage media, which is failsafe.

Log-based recovery works as follows −

  • The log file is kept on a stable storage media.

  • When a transaction enters the system and starts execution, it writes a log about it.

<Tn, Start>

  • When the transaction modifies an item X, it write logs as follows −

<Tn, X, V1, V2>

It reads Tn has changed the value of X, from V1 to V2.

  • When the transaction finishes, it logs −

<Tn, commit>

The database can be modified using two approaches −

  • Deferred database modification − All logs are written on to the stable storage and the database is updated when a transaction commits.

  • Immediate database modification − Each log follows an actual database modification. That is, the database is modified immediately after every operation.

Recovery with Concurrent Transactions

When more than one transaction are being executed in parallel, the logs are interleaved. At the time of recovery, it would become hard for the recovery system to backtrack all logs, and then start recovering. To ease this situation, most modern DBMS use the concept of 'checkpoints'.

Checkpoint

Keeping and maintaining logs in real time and in real environment may fill out all the memory space available in the system. As time passes, the log file may grow too big to be handled at all. Checkpoint is a mechanism where all the previous logs are removed from the system and stored permanently in a storage disk. Checkpoint declares a point before which the DBMS was in consistent state, and all the transactions were committed.

Recovery

When a system with concurrent transactions crashes and recovers, it behaves in the following manner −

Recovery

  • The recovery system reads the logs backwards from the end to the last checkpoint.

  • It maintains two lists, an undo-list and a redo-list.

  • If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>, it puts the transaction in the redo-list.

  • If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it puts the transaction in undo-list.

All the transactions in the undo-list are then undone and their logs are removed. All the transactions in the redo-list and their previous logs are removed and then redone before saving their logs.



  1. Query processing and optimization


The main goal of creating a database is to store the related data at one place, access and manipulate them as and when it is required by the user. Accessing and manipulating the data should be done efficiently i.e.; it should be accessed easily and quickly.

What DBMS does is it asks its users to write query in SQL. It verifies the code written by the user and then converts them into low level languages. It then selects the best execution path and executes the query and gets the data from internal memory. All these processes are together known as query processing.

Query processing is a process for the purpose of transforming a high level query language into a correct and efficient execution plan expressed in low level language.



There are several steps of any query processing. These steps are given below:

https://learning.oreilly.com/library/view/express-learning-database/9788131760802/images/ch08.f01.jpg

  • Query Parsing and Translation : At this step, Parser check the syntax and query. Then translator translate the query into different possible query plan . Each query plan represented by a relational algebra.

  • Query Optimization : At this step, Query optimizer takes the best possible query plan to process with checking in the system for the conditions and indexes.

  • Query Execution : It is the final step of query processing. At this step , Command processor executes the above query plan retrieve the data from the database and returns the result.


  1. Indexing

  2. Distributed database systems and object oriented database system

In a distributed database, there are a number of databases that may be geographically distributed all over the world. A distributed DBMS manages the distributed database in a manner so that it appears as one single database to users. In the later part of the chapter, we go on to study the factors that lead to distributed databases, its advantages and disadvantages.

A distributed database is a collection of multiple interconnected databases, which are spread physically across various locations that communicate via a computer network.

Features

  • Databases in the collection are logically interrelated with each other. Often they represent a single logical database.

  • Data is physically stored across multiple sites. Data in each site can be managed by a DBMS independent of the other sites.

  • The processors in the sites are connected via a network. They do not have any multiprocessor configuration.

  • A distributed database is not a loosely connected file system.

  • A distributed database incorporates transaction processing, but it is not synonymous with a transaction processing system.

Distributed Database Management System

A distributed database management system (DDBMS) is a centralized software system that manages a distributed database in a manner as if it were all stored in a single location.

Features

  • It is used to create, retrieve, update and delete distributed databases.

  • It synchronizes the database periodically and provides access mechanisms by the virtue of which the distribution becomes transparent to the users.

  • It ensures that the data modified at any site is universally updated.

  • It is used in application areas where large volumes of data are processed and accessed by numerous users simultaneously.

  • It is designed for heterogeneous database platforms.

  • It maintains confidentiality and data integrity of the databases.

Factors Encouraging DDBMS

The following factors encourage moving over to DDBMS −

  • Distributed Nature of Organizational Units − Most organizations in the current times are subdivided into multiple units that are physically distributed over the globe. Each unit requires its own set of local data. Thus, the overall database of the organization becomes distributed.

  • Need for Sharing of Data − The multiple organizational units often need to communicate with each other and share their data and resources. This demands common databases or replicated databases that should be used in a synchronized manner.

  • Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) work upon diversified systems which may have common data. Distributed database systems aid both these processing by providing synchronized data.

  • Database Recovery − One of the common techniques used in DDBMS is replication of data across different sites. Replication of data automatically helps in data recovery if database in any site is damaged. Users can access data from other sites while the damaged site is being reconstructed. Thus, database failure may become almost inconspicuous to users.

  • Support for Multiple Application Software − Most organizations use a variety of application software each with its specific database support. DDBMS provides a uniform functionality for using the same data among different platforms.

Advantages of Distributed Databases

Following are the advantages of distributed databases over centralized databases.

Modular Development − If the system needs to be expanded to new locations or new units, in centralized database systems, the action requires substantial efforts and disruption in the existing functioning. However, in distributed databases, the work simply requires adding new computers and local data to the new site and finally connecting them to the distributed system, with no interruption in current functions.

More Reliable − In case of database failures, the total system of centralized databases comes to a halt. However, in distributed systems, when a component fails, the functioning of the system continues may be at a reduced performance. Hence DDBMS is more reliable.

Better Response − If data is distributed in an efficient manner, then user requests can be met from local data itself, thus providing faster response. On the other hand, in centralized systems, all queries have to pass through the central computer for processing, which increases the response time.

Lower Communication Cost − In distributed database systems, if data is located locally where it is mostly used, then the communication costs for data manipulation can be minimized. This is not feasible in centralized systems.

Object Oriented Database (OODB) provides all the facilities associated with object oriented paradigm. It enables us to create classes, organize objects, structure an inheritance hierarchy and call methods of other classes. Besides these, it also provides the facilities associated with standard database systems. However, object oriented database systems have not yet replaced the RDBMS in commercial business applications.

The user will create classes, objects, inheritance and so on and the database system will store and manage these objects and classes. This second approach, thus, turns non-OOPLs into OOPLs. A translation layer is required to map the objects created by user into objects of the database system.



Advantages and Disadvantages of OODBMSS


Enriched modeling capabilities

Extensibility

Capable of handling a large variety of data types

Removal of impedance mismatch

More expressive query language

Support for schema evolution

Support for long-duration, transactions

Applicability to advanced database applications

Improved performance


Disadvantages of OODBMSs 

Lack of universal data model:

Lack of experience:

Lack of standards:

Competition:

Complexity:

Lack of support for views:

Lack of support for security:





  1. Data mining and data Warehousing


Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data.

Data Mining 

Data mining refers to extracting knowledge from large amounts of data. The data sources can include databases, data warehouse, web etc.

Knowledge discovery is an iterative sequence:

  1. Data cleaning – Remove inconsistent data.

  2. Data integration – Combining multiple data sources into one.

  3. Data selection – Select only relevant data to be analysed.

  4. Data transformation – Data is transformed into appropriate form for mining.

  5. Data mining – methods to extract data patterns.

  6. Pattern evaluation – identify interesting patterns in the data.

  7. Knowledge representation- visualization and knowledge representation techniques are used.

What kind of data that can be mined?

  1. Database Data

  2. Data Warehouse 

  3. Transactional Data

Scope of Data mining

  1. Automated Prediction of trends and behaviours: Data mining automates the process of finding the predictive information in large databases. For example : Consider a marketing company. In this company, data mining uses the past promotional mailing to identify the targets to maximize the return.

  1. Automated discovery of previously unknown patterns: Data mining sweeps through the database and identifies previously hidden patterns. For example: In a retail store data mining will go through the entire database and find the pattern for the items which are usually brought together.


Understanding a Data Warehouse

  • A data warehouse is a database, which is kept separate from the organization's operational database.

  • There is no frequent updating done in a data warehouse.

  • It possesses consolidated historical data, which helps the organization to analyze its business.

  • A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.

  • Data warehouse systems help in the integration of diversity of application systems.

  • A data warehouse system helps in consolidated historical data analysis.

Why a Data Warehouse is Separated from Operational Databases

A data warehouses is kept separate from operational databases due to the following reasons −

  • An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data.

  • Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database.

  • An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data.

  • An operational database maintains current data. On the other hand, a data warehouse maintains historical data.

Data Warehouse Features

The key features of a data warehouse are discussed below −

  • Subject Oriented − A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making.

  • Integrated − A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data.

  • Time Variant − The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view.

  • Non-volatile − Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.

Note − A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database.

Data Warehouse Applications

As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields −

  • Financial services

  • Banking services

  • Consumer goods

  • Retail sectors

  • Controlled manufacturing

Types of Data Warehouse

Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below −

  • Information Processing − A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.

  • Analytical Processing − A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.

  • Data Mining − Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools.

Sr.No.

Data Warehouse (OLAP)

Operational Database(OLTP)

1

It involves historical processing of information.

It involves day-to-day processing.

2

OLAP systems are used by knowledge workers such as executives, managers, and analysts.

OLTP systems are used by clerks, DBAs, or database professionals.

3

It is used to analyze the business.

It is used to run the business.

4

It focuses on Information out.

It focuses on Data in.

5

It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema.

It is based on Entity Relationship Model.

6

It focuses on Information out.

It is application oriented.

7

It contains historical data.

It contains current data.

8

It provides summarized and consolidated data.

It provides primitive and highly detailed data.

9

It provides summarized and multidimensional view of data.

It provides detailed and flat relational view of data.

10

The number of users is in hundreds.

The number of users is in thousands.

11

The number of records accessed is in millions.

The number of records accessed is in tens.

12

The database size is from 100GB to 100 TB.

The database size is from 100 MB to 100 GB.

13

These are highly flexible.

It provides high performance.



  1. Security Management system

Database Security and Threats

Data security is an imperative aspect of any database system. It is of particular importance in distributed systems because of large number of users, fragmented and replicated data, multiple sites and distributed control.

Threats in a Database

  • Availability loss − Availability loss refers to non-availability of database objects by legitimate users.

  • Integrity loss − Integrity loss occurs when unacceptable operations are performed upon the database either accidentally or maliciously. This may happen while creating, inserting, updating or deleting data. It results in corrupted data leading to incorrect decisions.

  • Confidentiality loss − Confidentiality loss occurs due to unauthorized or unintentional disclosure of confidential information. It may result in illegal actions, security threats and loss in public confidence.

Measures of Control

The measures of control can be broadly divided into the following categories −

  • Access Control − Access control includes security mechanisms in a database management system to protect against unauthorized access. A user can gain access to the database after clearing the login process through only valid user accounts. Each user account is password protected.

  • Flow Control − Distributed systems encompass a lot of data flow from one site to another and also within a site. Flow control prevents data from being transferred in such a way that it can be accessed by unauthorized agents. A flow policy lists out the channels through which information can flow. It also defines security classes for data as well as transactions.

  • Data Encryption − Data encryption refers to coding data when sensitive data is to be communicated over public channels. Even if an unauthorized agent gains access of the data, he cannot understand it since it is in an incomprehensible format.

What is Cryptography?

Cryptography is the science of encoding information before sending via unreliable communication paths so that only an authorized receiver can decode and use it.

The coded message is called cipher text and the original message is called plain text. The process of converting plain text to cipher text by the sender is called encoding or encryption. The process of converting cipher text to plain text by the receiver is called decoding or decryption.

The entire procedure of communicating using cryptography can be illustrated through the following diagram −

Cryptography

Conventional Encryption Methods

In conventional cryptography, the encryption and decryption is done using the same secret key. Here, the sender encrypts the message with an encryption algorithm using a copy of the secret key. The encrypted message is then send over public communication channels. On receiving the encrypted message, the receiver decrypts it with a corresponding decryption algorithm using the same secret key.

Security in conventional cryptography depends on two factors −

  • A sound algorithm which is known to all.

  • A randomly generated, preferably long secret key known only by the sender and the receiver.

The most famous conventional cryptography algorithm is Data Encryption Standard or DES.

The advantage of this method is its easy applicability. However, the greatest problem of conventional cryptography is sharing the secret key between the communicating parties. The ways to send the key are cumbersome and highly susceptible to eavesdropping.

Public Key Cryptography

In contrast to conventional cryptography, public key cryptography uses two different keys, referred to as public key and the private key. Each user generates the pair of public key and private key. The user then puts the public key in an accessible place. When a sender wants to sends a message, he encrypts it using the public key of the receiver. On receiving the encrypted message, the receiver decrypts it using his private key. Since the private key is not known to anyone but the receiver, no other person who receives the message can decrypt it.

The most popular public key cryptography algorithms are RSA algorithm and Diffie– Hellman algorithm. This method is very secure to send private messages. However, the problem is, it involves a lot of computations and so proves to be inefficient for long messages.

The solution is to use a combination of conventional and public key cryptography. The secret key is encrypted using public key cryptography before sharing between the communicating parties. Then, the message is send using conventional cryptography with the aid of the shared secret key.

Digital Signatures

A Digital Signature (DS) is an authentication technique based on public key cryptography used in e-commerce applications. It associates a unique mark to an individual within the body of his message. This helps others to authenticate valid senders of messages.

Typically, a user’s digital signature varies from message to message in order to provide security against counterfeiting. The method is as follows −

  • The sender takes a message, calculates the message digest of the message and signs it digest with a private key.

  • The sender then appends the signed digest along with the plaintext message.

  • The message is sent over communication channel.

  • The receiver removes the appended signed digest and verifies the digest using the corresponding public key.

  • The receiver then takes the plaintext message and runs it through the same message digest algorithm.

  • If the results of step 4 and step 5 match, then the receiver knows that the message has integrity and authentic.

No comments:

Post a Comment

    Computer Organization and Architecture ·          Computer Architecture refers to those attributes of a system that have a direct ...