Predicting Optimal Crops Based on Soil Measurements

A farmer examining soil in a field

As a passionate machine learning enthusiast and someone fascinated by agriculture, I love applying data science to real-world problems. Recently, a farmer reached out to me for help in deciding which crop to plant for the best yield. The challenge? Making the most out of the soil’s natural nutrients.

Measuring soil metrics like nitrogen (N), phosphorous (P), potassium (K), and pH is critical for understanding soil health, but it can be expensive and time-consuming. The farmer provided a dataset, soil_measures.csv, containing these measurements along with the crop that historically performed best under those conditions. My goal was to build a model that could predict the optimal crop based on soil composition and identify the most important soil feature for this task.

The Dataset

The dataset contains the following columns:

  • N: Nitrogen content ratio in the soil
  • P: Phosphorous content ratio in the soil
  • K: Potassium content ratio in the soil
  • pH: Soil acidity/alkalinity value
  • crop: The type of crop that thrives best in that soil (target variable)
Each row represents a unique field with its soil measurements and the historically optimal crop.

Data Preparation

import pandas as pd
from sklearn.model_selection import train_test_split 
# Load the dataset
crops = pd.read_csv("soil_measures.csv")
# Separate features and target
X = crops.drop("crop", axis=1)
y = crops["crop"]
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Analysis: Evaluating Feature Importance

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Evaluate each feature individually
features = {}
for feature in ['N', 'P', 'K', 'ph']:
    log_reg = LogisticRegression(multi_class='multinomial')
    log_reg.fit(X_train[[feature]], y_train)
    y_pred = log_reg.predict(X_test[[feature]])
    f1_score = metrics.f1_score(y_test, y_pred, average='weighted')
    features[feature] = f1_score
# Identify best predictive feature
best_predictive_feature = {max(features, key=features.get): max(features.values())}
print(features)
{'N': 0.0956, 'P': 0.1241, 'K': 0.2347, 'ph': 0.0679}

Interpretation

From the analysis, Potassium (K) has the highest F1-score (0.2347) among all the soil features. But what does this mean?

The F1-score is a metric that balances precision (how often the model is correct when it predicts a crop) and recall (how often the model finds all correct crops for a feature). In simple terms, it tells us how well the model predicts the right crop while avoiding mistakes. A score closer to 1 means excellent predictive performance, while a score near 0 means the feature alone is not very informative.

Since K has the highest F1-score, it suggests that **potassium content is the most important single soil feature for predicting which crop will grow best**. This insight can help the farmer prioritize measuring potassium for better crop selection and higher yields.

This project highlights how machine learning can help make **data-driven farming decisions** and optimize crop yield using key soil metrics.