Women's vs Men's Soccer Analysis ⚽

As a lifelong football fan, I’ve spent countless hours watching matches — from thrilling men's World Cup finals to exciting women's international tournaments. The beautiful game has always been my passion, and I can’t help but notice patterns that others might miss.
Over the years, one thing has really caught my eye: women's international matches often seem to have more goals than men's. As someone who loves both watching and analyzing football, I couldn’t resist digging into the data to see if my gut feeling is correct. This could make for an exciting story my readers would love!
To make sure my analysis is fair, I decided to focus on official FIFA World Cup matches (excluding qualifiers) since January 1, 2002. The game has evolved a lot over the years, so I wanted to compare the most recent trends in both men’s and women’s football.
I gathered two datasets with the results of every official international match:
women_results.csv– all women's matchesmen_results.csv– all men's matches
Here’s the question I wanted to answer:
Do women's international soccer matches really have more goals than men's?
I decided to test this using a 10% significance level, comparing the mean number of goals scored in each gender’s matches:
- H₀ (Null Hypothesis): The mean number of goals scored in women's matches is the same as in men's.
- Hₐ (Alternative Hypothesis): The mean number of goals scored in women's matches is greater than in men's.
Analysis
# Start your analysis!
import pandas as pd
from scipy.stats import ttest_ind
import pingouin
df_man = pd.read_csv(’/kaggle/input/portfolio-projects-data/men_results.csv’)
df_woman = pd.read_csv(’/kaggle/input/portfolio-projects-data/women_results.csv')
df_man_s = df_man[(df_man[’tournament’]==‘FIFA World Cup’) & (df_man[‘date’] > ‘2002-01-01’)]
df_woman_s = df_woman[(df_woman[’tournament’]==‘FIFA World Cup’) & (df_woman[‘date’] > ‘2002-01-01’)]
df_man_s[‘goal_score’] = df_man_s[‘home_score’] + df_man_s[‘away_score’]
df_woman_s[‘goal_score’] = df_woman_s[‘home_score’] + df_woman_s[‘away_score’]
df_man_s[‘gender’] = ‘man’
df_woman_s[‘gender’] = ‘woman’
df_wo_man = pd.concat([df_man_s, df_woman_s], axis=0, ignore_index=True)
df_sub = df_wo_man[[‘goal_score’, ‘gender’]]
df_pivot = df_sub.pivot(columns=‘gender’, values=‘goal_score’)
results = pingouin.mwu(x=df_pivot[‘woman’], y=df_pivot[‘man’], alternative=‘greater’)
result_dict = {‘p_val’: results[‘p-val’][0], ‘result’: ‘reject’}
print(result_dict)
{'p_val': 0.005106609825443641, 'result': 'reject'}What does it mean to reject the hypothesis?
In simple terms, rejecting the null hypothesis (H₀) means that the data provides **strong evidence** that the alternative hypothesis (Hₐ) is true. In our case, it suggests that the **average number of goals in women's international matches is indeed higher than in men's** at the 10% significance level.
This doesn’t “prove” it with 100% certainty — statistics never do — but it gives us confidence that the trend we observed as a football fan is backed up by real data.