A. Introduction

Databricks is web based platform, used for all types of data related operation. It is developed by creators of ‘Apache’. Just like Apache spark, Databricks provides web based cluster management and notebook like structure. Working on data requires storage, analysis, building models, visualisation tools. As these tools are easily available and support to cloud services are easily done, Databricks attracts people from the community of data analytics, Data scientist, data engineers.

Databricks support for different services

B. Diving into Databricks

There are six important terminologies of Databricks which are very much important to understand while working on it.

1. Workspace:

  • It is a space which organises notebooks, libraries, dashboard within Databricks. Each workspace is isolated from another workspace.
  • Databricks provide ‘Databricks CLI’, ’Databricks REST APIs’ and UI to manage workspace.

2. Notebooks:

  • Notebooks contains code which is to be used run defined operation
  • Code in any spark supported Languages
  • Multiple users can edit and share documents

3. Tables:

  • Databricks support managed and unmanaged tables

4. Jobs:

  • A job is a way to run code in Databricks. A job can be run interactive or non-interactive way.
  • Jobs can be created and run with the help of CLI, UI or JOB API.
  • These jobs are also monitored with CLI, UI and API.

5. Clusters:

  • It is computational resources which is used to run data related operations. It can be run with help of sequentially commands (in notebooks) or in automatic jobs also.
  • Clusters can be created, terminate or restart using REST API, CLI or from UI also.

6. Libraries:

  • Libraries written in python , scala, java and ‘R’ are supported in Databricks to run jobs on clusters. Workspace, cluster-installed, and notebook, there are the three ways by which custom libraries can be installed

C. Advantages

  • Support for various scripting language like Python , Scala , R and SQL
  • Support for various ML frameworks like Tensorflow, Pytorch, scikit-learn
  • Collaborative workspace that allows team to work on data
  • Automate job to run application on any defined schedule or on demand
  • Suitable small as well as for large job also
  • Easily Integrate with remote repo
  • Built in support for data visualisation
  • With MLFLOW Databricks support for end to end machine learning lifecycle

D. Conclusion

As Databricks –

  • handles different types of data,
  •  gives solution on different data operations
  •  Collaborative environment to work within team,
  • Workloads run by either interactively or scheduled

It becomes very much popular tool in the fields of data Analyst, Data Engineers and Data scientist.


-Mandar Kulkarni