Chapter 2: Data Handling Using Pandas - I:
Introduction to Python Libraries
Python is widely used in data science and analytics due to its extensive libraries designed for efficient data processing.
Primary Libraries for Data Science:
NumPy: Used for numerical computations and working with arrays.
Pandas: A high-level data manipulation tool, providing data structures like Series and DataFrame.
Matplotlib: A visualization library for plotting graphs and charts.
Difference between Pandas and NumPy
Data Types:
NumPy arrays are homogeneous, meaning all elements must be of the same data type.
Pandas DataFrames can contain multiple data types, allowing for more flexible data handling.
Data Manipulation:
Pandas offers higher-level functionality like grouping, merging, and reshaping, which are either limited or unavailable in NumPy.
Tabular Data:
Pandas is optimized for data in rows and columns, making it a better choice for handling structured data.
Installing Pandas
Install Pandas using the Python package manager with the command:
python
Copy code
pip install pandas
Series in Pandas
A Series is a one-dimensional labeled array that can hold data of any type (integers, floats, strings, etc.). Each element in a Series is associated with a label or index.
Creating a Series
From a Scalar Value:
A single value, applied to each index in the Series.
Example:
python
Copy code
import pandas as pd
s = pd.Series(5, index=[0, 1, 2])From a List or Array:
Series can be created from lists, where each element in the list becomes an element in the Series.
Example:
python
Copy code
data = [10, 20, 30]
s = pd.Series(data)From a Dictionary:
The dictionary keys become the index of the Series, and values become the Series data.
Example:
python
Copy code
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)
Accessing Elements in a Series
Indexing:
Use s[index] to access elements by position or label.
Example:
python
Copy code
print(s[0]) # Access by position
print(s['a']) # Access by labelSlicing:
Allows retrieval of a subset of elements using start:end.
Example:
python
Copy code
print(s[1:3]) # Returns elements from index 1 to 2
Series Attributes
index: Returns the labels (index) of the Series.
values: Returns the Series values as an array.
size: Number of elements in the Series.
dtype: Data type of the Series elements.
empty: Checks if the Series is empty.
Series Methods
head(n): Returns the first n elements.
tail(n): Returns the last n elements.
count(): Counts non-null values.
sum(): Returns the sum of elements.
mean(): Calculates the average value.
DataFrame in Pandas
A DataFrame is a two-dimensional data structure, similar to a table with rows and columns.
Creating a DataFrame
From a Dictionary of Lists:
Keys are column names, and values are lists representing column data.
Example:
python
Copy code
data = {'Name': ['John', 'Anna'], 'Age': [25, 28]}
df = pd.DataFrame(data)From a List of Dictionaries:
Each dictionary represents a row, and keys serve as column names.
Example:
python
Copy code
data = [{'Name': 'John', 'Age': 25}, {'Name': 'Anna', 'Age': 28}]
df = pd.DataFrame(data)From a NumPy Array:
Directly creating DataFrame from arrays with specified column names.
Example:
python
Copy code
import numpy as np
data = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(data, columns=['A', 'B'])
Operations on DataFrames
Adding Columns:
New columns can be added directly by specifying the column name and assigning values.
Example:
python
Copy code
df['Salary'] = [50000, 60000]
Deleting Rows/Columns:
Use the drop() function to delete rows or columns by label.
Example:
python
Copy code
df.drop('Age', axis=1, inplace=True) # Deletes the 'Age' column
Renaming Columns:
The rename() method allows renaming of column labels.
Example:
python
Copy code
df.rename(columns={'Name': 'Employee Name'}, inplace=True)
Accessing DataFrame Elements
Label-based Indexing:
Access specific columns or rows using labels.
Example:
python
Copy code
df['Name'] # Accesses the 'Name' column
Boolean Indexing:
Filter rows based on conditions.
Example:
python
Copy code
df[df['Age'] > 25] # Rows where Age > 25
Slicing:
Use slicing for subsets of rows and columns.
Example:
python
Copy code
df.loc[0:1, ['Name', 'Age']] # Rows 0 to 1, only 'Name' and 'Age' columns
Joining, Merging, and Concatenation
Appending Data:
Use append() to add rows from one DataFrame to another.
Example:
python
Copy code
df1.append(df2, ignore_index=True)
Merging:
Combines data from different DataFrames based on common columns or indexes.
Concatenation:
Joins multiple DataFrames along a particular axis (row-wise or column-wise).
DataFrame Attributes
index: Lists row labels.
columns: Lists column labels.
dtypes: Data types of each column.
shape: Returns the DataFrame’s dimensions.
values: Returns data in the DataFrame as a NumPy array.
Importing and Exporting Data between CSV Files and DataFrames
Importing Data:
read_csv(): Reads data from a CSV file into a DataFrame.
Example:
python
Copy code
df = pd.read_csv('data.csv')
Exporting Data:
to_csv(): Exports DataFrame contents to a CSV file.
Example:
python
Copy code
df.to_csv('output.csv', index=False)
Pandas Series vs NumPy ndarray
Series:
Can contain elements of different types and have non-numeric indexes.
Allows automatic alignment by index labels, which is useful for data manipulation.
ndarray:
A NumPy array with fixed-size elements of the same type.
Optimized for mathematical operations but lacks the flexible indexing available in Series.