Chapter 2: Data Handling Using Pandas - I: || Informatics Practices (IP) || Class 12th || NCERT CBSE || NOTES IN ENGLISH || 2024-25

Chapter 2: Data Handling Using Pandas - I:

Introduction to Python Libraries

Python is widely used in data science and analytics due to its extensive libraries designed for efficient data processing.
Primary Libraries for Data Science:

NumPy: Used for numerical computations and working with arrays.
Pandas: A high-level data manipulation tool, providing data structures like Series and DataFrame.
Matplotlib: A visualization library for plotting graphs and charts.

Difference between Pandas and NumPy

Data Types:

NumPy arrays are homogeneous, meaning all elements must be of the same data type.
Pandas DataFrames can contain multiple data types, allowing for more flexible data handling.

Data Manipulation:

Pandas offers higher-level functionality like grouping, merging, and reshaping, which are either limited or unavailable in NumPy.

Tabular Data:

Pandas is optimized for data in rows and columns, making it a better choice for handling structured data.

Installing Pandas

Install Pandas using the Python package manager with the command:
python
Copy code
pip install pandas

Series in Pandas

A Series is a one-dimensional labeled array that can hold data of any type (integers, floats, strings, etc.). Each element in a Series is associated with a label or index.

Creating a Series

From a Scalar Value:

A single value, applied to each index in the Series.
Example:
python
Copy code
import pandas as pd
s = pd.Series(5, index=[0, 1, 2])

From a List or Array:

Series can be created from lists, where each element in the list becomes an element in the Series.
Example:
python
Copy code
data = [10, 20, 30]
s = pd.Series(data)

From a Dictionary:

The dictionary keys become the index of the Series, and values become the Series data.
Example:
python
Copy code
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)

Accessing Elements in a Series

Indexing:

Use s[index] to access elements by position or label.
Example:
python
Copy code
print(s[0]) # Access by position
print(s['a']) # Access by label

Slicing:

Allows retrieval of a subset of elements using start:end.

Example:
python
Copy code
print(s[1:3]) # Returns elements from index 1 to 2

Series Attributes

index: Returns the labels (index) of the Series.
values: Returns the Series values as an array.
size: Number of elements in the Series.
dtype: Data type of the Series elements.
empty: Checks if the Series is empty.

Series Methods

head(n): Returns the first n elements.
tail(n): Returns the last n elements.
count(): Counts non-null values.
sum(): Returns the sum of elements.
mean(): Calculates the average value.

DataFrame in Pandas

A DataFrame is a two-dimensional data structure, similar to a table with rows and columns.

Creating a DataFrame

From a Dictionary of Lists:

Keys are column names, and values are lists representing column data.
Example:
python
Copy code
data = {'Name': ['John', 'Anna'], 'Age': [25, 28]}
df = pd.DataFrame(data)

From a List of Dictionaries:

Each dictionary represents a row, and keys serve as column names.
Example:
python
Copy code
data = [{'Name': 'John', 'Age': 25}, {'Name': 'Anna', 'Age': 28}]
df = pd.DataFrame(data)

From a NumPy Array:

Directly creating DataFrame from arrays with specified column names.
Example:
python
Copy code
import numpy as np
data = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(data, columns=['A', 'B'])

Operations on DataFrames

Adding Columns:

New columns can be added directly by specifying the column name and assigning values.

Example:
python
Copy code
df['Salary'] = [50000, 60000]

Deleting Rows/Columns:

Use the drop() function to delete rows or columns by label.

Example:
python
Copy code
df.drop('Age', axis=1, inplace=True) # Deletes the 'Age' column

Renaming Columns:

The rename() method allows renaming of column labels.

Example:
python
Copy code
df.rename(columns={'Name': 'Employee Name'}, inplace=True)

Accessing DataFrame Elements

Label-based Indexing:

Access specific columns or rows using labels.

Example:
python
Copy code
df['Name'] # Accesses the 'Name' column

Boolean Indexing:

Filter rows based on conditions.

Example:
python
Copy code
df[df['Age'] > 25] # Rows where Age > 25

Slicing:

Use slicing for subsets of rows and columns.

Example:
python
Copy code
df.loc[0:1, ['Name', 'Age']] # Rows 0 to 1, only 'Name' and 'Age' columns

Joining, Merging, and Concatenation

Appending Data:

Use append() to add rows from one DataFrame to another.

Example:
python
Copy code
df1.append(df2, ignore_index=True)

Merging:

Combines data from different DataFrames based on common columns or indexes.

Concatenation:

Joins multiple DataFrames along a particular axis (row-wise or column-wise).

DataFrame Attributes

index: Lists row labels.
columns: Lists column labels.
dtypes: Data types of each column.
shape: Returns the DataFrame’s dimensions.
values: Returns data in the DataFrame as a NumPy array.

Importing and Exporting Data between CSV Files and DataFrames

Importing Data:

read_csv(): Reads data from a CSV file into a DataFrame.

Example:
python
Copy code
df = pd.read_csv('data.csv')

Exporting Data:

to_csv(): Exports DataFrame contents to a CSV file.

Example:
python
Copy code
df.to_csv('output.csv', index=False)

Pandas Series vs NumPy ndarray

Series:

Can contain elements of different types and have non-numeric indexes.
Allows automatic alignment by index labels, which is useful for data manipulation.

ndarray:

A NumPy array with fixed-size elements of the same type.
Optimized for mathematical operations but lacks the flexible indexing available in Series.