Databricks: Part 1 - Introduction

A. Introduction

Databricks is web based platform, used for all types of data related operation. It is developed by creators of ‘Apache’. Just like Apache spark, Databricks provides web based cluster management and notebook like structure. Working on data requires storage, analysis, building models, visualisation tools. As these tools are easily available and support to cloud services are easily done, Databricks attracts people from the community of data analytics, Data scientist, data engineers.

Databricks support for different services

B. Diving into Databricks

There are six important terminologies of Databricks which are very much important to understand while working on it.

1. Workspace:

It is a space which organises notebooks, libraries, dashboard within Databricks. Each workspace is isolated from another workspace.
Databricks provide ‘Databricks CLI’, ’Databricks REST APIs’ and UI to manage workspace.

2. Notebooks:

Notebooks contains code which is to be used run defined operation
Code in any spark supported Languages
Multiple users can edit and share documents

3. Tables:

Databricks support managed and unmanaged tables

4. Jobs:

A job is a way to run code in Databricks. A job can be run interactive or non-interactive way.
Jobs can be created and run with the help of CLI, UI or JOB API.
These jobs are also monitored with CLI, UI and API.

5. Clusters:

It is computational resources which is used to run data related operations. It can be run with help of sequentially commands (in notebooks) or in automatic jobs also.
Clusters can be created, terminate or restart using REST API, CLI or from UI also.

6. Libraries:

Libraries written in python , scala, java and ‘R’ are supported in Databricks to run jobs on clusters. Workspace, cluster-installed, and notebook, there are the three ways by which custom libraries can be installed

C. Advantages

Support for various scripting language like Python , Scala , R and SQL
Support for various ML frameworks like Tensorflow, Pytorch, scikit-learn
Collaborative workspace that allows team to work on data
Automate job to run application on any defined schedule or on demand
Suitable small as well as for large job also
Easily Integrate with remote repo
Built in support for data visualisation
With MLFLOW Databricks support for end to end machine learning lifecycle

D. Conclusion

As Databricks –

handles different types of data,
gives solution on different data operations
Collaborative environment to work within team,
Workloads run by either interactively or scheduled

It becomes very much popular tool in the fields of data Analyst, Data Engineers and Data scientist.

-Mandar Kulkarni

Databricks: Part 1 – Introduction

A. Introduction

B. Diving into Databricks

C. Advantages

D. Conclusion

Leave a Reply