In the world of data science, understanding how to calculate percentiles is essential for analyzing data distributions, detecting outliers, and making data-driven decisions. Whether you're working with Python or R, calculating percentiles is a fundamental skill. In this comprehensive guide, we’ll explore 10 different methods for calculating percentiles, ranging from simple approaches using popular libraries like NumPy and Pandas, to more advanced custom solutions. Whether you're a beginner or an experienced data scientist, this blog will help you master percentile calculations and improve your data analysis techniques. Dive in to explore the diverse ways to compute percentiles in Python and R, and discover which method best fits your data needs."
1. Using NumPy
NumPy provides a straightforward way to calculate percentiles using the np.percentile() function. This method takes in an array of values and a desired percentile, typically between 0 and 100. It is widely used for its efficiency and simplicity, especially when working with large datasets in NumPy arrays. The function allows you to quickly access a specific percentile, making it ideal for numerical computations and scientific tasks.
import numpy as np
import pandas as pd
df = pd.DataFrame({'values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
# 50th_percentile (median)
percentile_50 = np.percentile(df['values'], 50)
# 75th and 90th percentile
percentile_75_90 = np.percentile(df['values'], [75, 90])
print(percentile_50, percentile_75_90)
2. Using pandas quantile()
Pandas provides a powerful and intuitive method for percentile calculation through the quantile() function. Unlike NumPy's approach, this method works directly on Pandas Series or DataFrames, and the percentile is represented as a decimal between 0 and 1 (e.g., 0.5 for the median). This method is particularly useful when working with tabular data and allows for seamless integration with other Pandas operations like grouping and aggregation.
50th_percentile = df['values'].quantile(0.5) # 50th percentile (same as median)
percentile_75_90 = df['values'].quantile([0.75, 0.90]) # 75th and 90th percentile
3. Using SciPy
The scipy.stats.scoreatpercentile() function in SciPy is another effective way to calculate percentiles. It is specifically designed for percentile calculation and works well for datasets where sorting or interpolation is necessary. Unlike np.percentile(), SciPy provides additional options, such as handling more complex statistical tasks when computing percentiles in large datasets. It’s a great choice when you need more control over the percentile computation process.
from scipy import stats
percentile_50 = stats.scoreatpercentile(df['values'], 50)
percentile_90 = stats.scoreatpercentile(df['values'], 90)
print(percentile_50, percentile_90)
Note: Unfortunately, the scoreatpercentile function from scipy.stats does not accept a list of percentiles directly unlike numpy.percentile(). It only takes a single percentile at a time.
4. Using Custom Function
For more flexibility, you can implement a custom function to calculate percentiles. This approach involves sorting the dataset and applying a formula to determine the appropriate value for the given percentile. It’s beneficial when the default methods (like np.percentile() or quantile()) do not meet specific needs, such as custom interpolation or other statistical requirements that are not directly available in libraries.
def calculate_percentile(data, percentile):
data_sorted = sorted(data)
index = int(len(data_sorted) * percentile / 100)
return data_sorted[index]
calculate_percentile(df['values'], 50)
calculate_percentile(df['values'], 90)
5. Using agg() to Compute Percentiles
The agg() function in Pandas allows you to calculate percentiles or allows aggregation on a DataFrame by applying multiple aggregation functions simultaneously. This method is particularly useful when you need to compute various statistics, including percentiles, across different groups or subsets of data. By passing np.percentile within the agg() function, you can calculate percentiles for each group efficiently, all within a single step. It's also support to use lambda functions or NumPy functions inside it
import pandas as pd
import numpy as np
df5 = pd.DataFrame({'values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
# Calculate multiple percentiles using agg
percentiles = df5['values'].agg([
('25th_percentile', lambda x: np.percentile(x, 25)),
('50th_percentile', lambda x: np.percentile(x, 50)), # Median
('75th_percentile', lambda x: np.percentile(x, 75))
])
print(percentiles)
6. Using groupby().agg() for Percentiles Within Groups:
The groupby().agg() combination in Pandas is a powerful technique to calculate percentiles within specific groups of data. By grouping data based on certain columns and applying the agg() function with a percentile calculation, you can easily compute percentiles per group. This method is especially effective when you have categorical data and want to compute percentiles across different subgroups in a DataFrame, such as calculating percentiles for sales per region.
df5['group'] = ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C']
percentiles_grouped = df5.groupby('group')['values'].agg([
('25th_percentile', lambda x: np.percentile(x, 25)),
('50th_percentile', lambda x: np.percentile(x, 50)),
('75th_percentile', lambda x: np.percentile(x, 75))
])
print(percentiles_grouped)
7. Alternative with quantile() Inside agg() Instead of np.percentile()
An alternative to using np.percentile() is integrating the quantile() function inside the agg() function. This approach allows you to compute percentiles directly within a grouped operation without relying on NumPy. By using quantile() in combination with agg(), you can handle the percentile calculation in a more Pandas-centric way, which may improve readability and simplify the overall syntax for data analysis.
Instead of np.percentile(), you can use .quantile():
df5['values'].agg([
('25th_percentile', lambda x: x.quantile(0.25)),
('50th_percentile', lambda x: x.quantile(0.50)),
('75th_percentile', lambda x: x.quantile(0.75))
])
8. Handling Multiple Percentiles at Once: The np.percentile() and .quantile() both supports multiple quantiles.
The np.percentile() also supports the calculation of multiple percentiles in one go by passing a list of percentile values. This is ideal when you need to compute several percentiles at once, such as the 25th, 50th, and 75th percentiles, without the need for multiple function calls. This method is highly efficient and works well for numerical analyses where you need to understand the distribution of data across various percentiles simultaneously.
percentiles_np = np.percentile(data, [25, 50, 75])
percentiles_pd = df['values'].quantile([0.25, 0.5, 0.75])
print(percentiles_np) # NumPy array output
print(percentiles_pd) # Pandas Series output
Feature | np.percentile() | .quantile() |
---|---|---|
Input format | 0-100 | 0-1 |
Works with | NumPy arrays | Pandas DataFrames/Series |
NaN Handling | Doesn't ignore NaNs (use nanpercentile() ) |
Ignores NaNs by default |
Multiple Percentiles | ✅ Yes (pass a list) | ✅ Yes (pass a list) |
Performance | Faster for large NumPy arrays | Optimized for Pandas |
9. calculate multiple percentiles (e.g., 25th, 50th, and 75th percentiles):
It’s very flexible, and you can calculate multiple percentiles at once by passing a dictionary or a list of functions to agg().
percentiles = df['score'].agg(lambda x: x.quantile([0.25, 0.50, 0.75]))
print(percentiles)
10. Using apply() with a custom function in Pandas:
The apply() function in Pandas offers a flexible way to calculate percentiles by applying a custom function to a DataFrame or Series. This is particularly useful when you have complex logic or need to perform percentiles computation based on specific conditions. By defining a custom function and applying it to each element of your data, you can tailor the percentile calculation to meet unique requirements not addressed by the standard methods like quantile() or np.percentile().
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'data': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})
# Define a custom function to calculate percentile
def calculate_percentile(series, percentile):
return series.quantile(percentile / 100)
# Apply custom percentile function to calculate the 90th percentile
percentile_90 = df['data'].apply(lambda x: calculate_percentile(df['data'], 90)).iloc[0]
print(f"90th Percentile: {percentile_90}")