Chapter 3: Data Handling using Pandas - II
Introduction
Pandas is a Python library used for data manipulation, processing, and analysis. Building on previous DataFrame basics, this chapter introduces advanced features like sorting, aggregating, and handling missing values in data.
Descriptive Statistics
Descriptive statistics summarize data and give insights into basic properties.
Key Statistical Functions:
max(): Finds the highest values in columns.
min(): Finds the lowest values.
sum(): Calculates column-wise totals.
count(): Counts non-null values.
mean(): Calculates the average.
median(): Finds the middle value.
mode(): Finds the most frequently occurring values.
quantile(): Divides data into quartiles.
var(): Finds variance, a measure of data spread.
std(): Finds standard deviation, indicating data dispersion.
Data Aggregations
Aggregation combines multiple values to return a single output using functions like max(), min(), sum(), count(), std(), and var().
Aggregations can be applied to one or more columns, producing summary statistics.
Sorting a DataFrame
Sorting arranges data by specified columns, either in ascending or descending order, using sort_values().
Syntax: DataFrame.sort_values(by=[column], axis=0, ascending=True).
Sorting can be performed on multiple columns, with secondary columns used when primary columns have identical values.
GROUP BY Functions
The GROUP BY function splits data based on a criterion and applies functions like sum, mean, and max on each group.
Steps in GROUP BY:
Split: Break data into groups based on a criterion.
Apply: Perform operations like sum or count on each group.
Combine: Merge results back into a new DataFrame.
Altering the Index
Indexing allows efficient data access and retrieval. A new column can be set as an index for better data organization.
reset_index(): Creates a new continuous index.
set_index(): Assigns a new column as the index.
Other DataFrame Operations
3.7.1 Reshaping Data
Pivot: Reshapes data for clarity. Example: Transforming year-wise sales data into a format with stores as rows and years as columns.
Pivot Table: Similar to pivot but handles duplicate entries by applying an aggregate function like sum or mean.
Handling Missing Values
Missing values can affect data analysis. Methods to address this include:
Dropping Rows: Removes rows with missing data using dropna().
Filling Missing Values: Replaces NaNs with meaningful values using fillna(), which can substitute with averages, zeros, or other custom values.