Pandas
Data cleaning and preprocessing: Pandas has a wide range of functions for cleaning and preprocessing data, such as handling missing values, filtering and sorting data, transforming data, and more.
Data visualization: Pandas provides easy integration with data visualization libraries such as Matplotlib and Seaborn to create various types of charts and graphs.
Merging and joining datasets: Pandas can merge and join datasets based on a common column or index, making it easy to combine data from multiple sources.
Reshaping and pivoting data: Pandas can reshape and pivot data in a variety of ways, such as stacking and unstacking data, creating pivot tables, and more.
Time series analysis: Pandas provides a wide range of functions for time series analysis, such as resampling data, shifting and lagging data, and more.
Data aggregation and groupby operations: Pandas can perform various types of data aggregation and groupby operations, such as calculating summary statistics, grouping data based on specific criteria, and more.
Working with different data formats: Pandas can read and write data in various formats such as CSV, Excel, SQL, and more.
import pandas as pd
import numpy as np
# Generate 1000 random customer IDs
customer_ids = np.random.randint(1001, 1101, size=1000)
# Generate random order values for each customer
order_values = np.random.normal(100, 50, size=1000)
# Create a DataFrame with customer IDs and order values
orders_df = pd.DataFrame({'customer_id': customer_ids, 'order_value': order_values})
# Save the DataFrame as a CSV file
#orders_df.to_csv('/Users/venkateshprasads/Documents/orders.csv', index=False)
orders_df = pd.read_csv('orders.csv')
# Display the first 5 rows of the dataset
print(orders_df.head())
# Check the shape of the dataset
print('shape ',orders_df.shape)
# Check for missing values
print(10*'*','null ids' ,10*'*','\n',orders_df.isnull().sum())
# Drop rows with missing values
orders_df = orders_df.dropna()
# Group the orders by customer ID and calculate the total order value
customer_orders = orders_df.groupby('customer_id')['order_value'].sum()
# Sort the customer orders in descending order
sorted_orders = customer_orders.sort_values(ascending=False)
# Calculate the average order value
average_order_value = orders_df['order_value'].mean()
# Filter the orders for customers who have spent more than the average
high_spending_customers = customer_orders[customer_orders < 100]
print(high_spending_customers.head())
top_10_customers = sorted_orders.head(10)
plt.bar(top_10_customers.index, top_10_customers.values)
plt.xlabel('Customer ID')
plt.ylabel('Total Order Value')
plt.title('Top 10 High-Spending Customers')
plt.show()