Databricks Python Notebook Examples For Success
Hey everyone! If you're diving into the world of big data analytics and machine learning, chances are you've heard of Databricks. It's a seriously powerful platform, and one of its coolest features is the ability to use Python notebooks. But where do you start? What are some Databricks Python notebook examples that can actually help you get things done? Well, you've come to the right place, guys! We're going to break down some awesome examples, from basic data loading to more advanced machine learning tasks, so you can hit the ground running and make the most of this incredible tool. Think of this as your cheat sheet to becoming a Databricks ninja!
Getting Started: Your First Databricks Python Notebook
Alright, let's kick things off with the basics. When you first open a new notebook in Databricks, it's like a blank canvas, right? But with Python, you can transform that canvas into a data powerhouse. One of the most fundamental tasks is loading data. Databricks makes this super easy, especially if your data is already in cloud storage like S3, ADLS, or GCS. For instance, imagine you have a CSV file. You can load it into a Spark DataFrame with just a few lines of Python. This DataFrame is the core data structure in Spark, and it's optimized for distributed processing, which is what Databricks is all about. So, let's say you have a file named my_data.csv in your Databricks File System (DBFS) or an external location. Your code might look something like this:
from pyspark.sql import SparkSession
# Create a SparkSession (usually already available in Databricks notebooks)
spark = SparkSession.builder.appName("BasicDataLoad").getOrCreate()
# Define the path to your CSV file
file_path = "/FileStore/tables/my_data.csv" # Example path in DBFS
# Load the CSV file into a Spark DataFrame
df = spark.read.csv(file_path, header=True, inferSchema=True)
# Display the first few rows of the DataFrame
df.show()
# Print the schema to see data types
df.printSchema()
See? Simple, right? The header=True part tells Spark that the first row is the header, and inferSchema=True tries to guess the data types of your columns. This is crucial because having the correct data types (like integer, string, timestamp) is essential for efficient processing and analysis. This initial step of data ingestion is the bedrock of any data project. Without it, you can't do anything else! Understanding how to load various file formats – CSV, JSON, Parquet, Avro – is a key skill. Databricks supports them all natively. For example, loading JSON is often as straightforward:
json_path = "/FileStore/tables/my_data.json"
df_json = spark.read.json(json_path)
df_json.show()
And for Parquet, which is a columnar storage format optimized for Spark:
parquet_path = "/databricks-datasets/airlines/part-00000-*.parquet"
df_parquet = spark.read.parquet(parquet_path)
df_parquet.show()
These basic data loading examples are the starting point for countless analytics tasks. They demonstrate the power and simplicity of using Python with Spark DataFrames in the Databricks environment. So, master these, and you're already on your way to becoming a data wizard!
Data Exploration and Transformation with Python
Once you've got your data loaded, the next natural step is to explore and transform it. This is where the real fun begins, and Databricks notebooks, powered by Python, shine. Data exploration involves understanding your dataset: What are the columns? What are the data types? Are there missing values? What are the distributions of your data? Data transformation involves cleaning the data, reshaping it, and preparing it for analysis or machine learning models. Let’s dive into some Databricks Python notebook examples for these tasks.
First, let's look at exploring our df DataFrame loaded earlier. After df.show() and df.printSchema(), you'll often want to get summary statistics. For numerical columns, .describe() is your best friend:
# Display summary statistics for numerical columns
df.describe().show()
This gives you counts, means, standard deviations, minimums, and maximums. Super useful for spotting outliers or understanding the range of your data. For categorical data, you might want to see the distinct values and their frequencies:
# Count occurrences of each unique value in a specific column (e.g., 'Category')
df.groupBy("Category").count().orderBy("count", ascending=False).show()
This tells you which categories are most common in your dataset. Data cleaning is another huge part of exploration. Missing values are a common problem. You can count them per column:
from pyspark.sql.functions import col, isnull
# Count null values for each column
df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).show()
Once you identify missing values, you need to decide how to handle them. Common strategies include dropping rows with nulls, filling them with a default value (like 0 or the mean), or using more sophisticated imputation techniques. For example, to fill nulls in a specific column, say Age, with the mean age:
from pyspark.sql.functions import mean
mean_age = df.select(mean("Age")).first()[0]
df_filled = df.na.fill(mean_age, subset=["Age"])
Transformation often involves creating new features or modifying existing ones. Let's say you have a timestamp column and want to extract the day of the week. You can use Spark SQL functions or PySpark functions:
from pyspark.sql.functions import dayofweek, date_format
# Extract day of the week (1=Sunday, 7=Saturday)
df_transformed = df.withColumn("day_of_week", dayofweek(col("timestamp")))
# Format the timestamp to just show the date
df_transformed = df_transformed.withColumn("date_only", date_format(col("timestamp"), "yyyy-MM-dd"))
df_transformed.select("timestamp", "day_of_week", "date_only").show(5)
Another common transformation is pivoting or unpivoting data. If you have data in a