Pre-Prediction Process on Placements using Python.

6 min readApr 29, 2021

The life of a student is not just a journey, it's an emotion! We greatly have the tendency to define “success” based on our limited perspectives. Generally, that perspective is influenced by others' thoughts and opinions.. its nothing but trying to go by what the majority does!

Being at the graduation level, all graduates undergo this stage of making good career choices. And job recruitment is one such thing that gets the top lead!.. After all, everyone aspires to be a job-holder.

Let us try to make a simple analysis on how this trend of being ‘Placed’ and ‘Not Placed’ is functioning.

We have identified a class of students, collected information right from their 10th and 12th percentage, board, their graduation details, employability test percentage, and finally given the output-which is the status of being Placed or Not Placed with their salary.

This sample dataset is taken from https://www.kaggle.com/benroshan/factors-affecting-campus-placement

Here is the sample how our data looks like:

We have found the total number of rows and columns using “.shape”.

Here the dataset comprises 215 observations and 15 characteristics

It is also a good practice to know the columns and their corresponding data types, along with finding whether they contain null values or not.

For this, we use the “.info” function which gives us information about the datatypes of each characteristic and about null characteristics.
It becomes easy for us to know what operations can we perform and operations that we can’t because we can easily identify the categorical variables and numerical variables by looking at the data types.
Data has float, integer, object values.
No variable column has null/missing values.

Here is the info about the dataset we’ve chosen

The “.describe” function in pandas is very handy in getting various summary statistics. This function returns the count, mean, standard deviation, minimum and maximum values, and the quantiles of the data.

Here we can notice that the mean value is almost equal to the median value of each column, represented by the 50th percentile.
So, we can conclude that it is symmetrical and has zero skewness(approx).
It is noticed that there is a considerable difference in the values of 75th percentile and max. This rightly depicts that the graph might have outliers.

Outlier identification:

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

In our dataset, how will this outlier make a change in the analysis of our data?
Consider the example of a salary package. If we would want to know about the trend in salary, we essentially rely upon its mean. But there might be a person or two who would be paid an exceedingly high or low amount of pay. This might mislead our understanding. So, it is important to properly deal with these outliers in order to get accurate results.

To do this operation, we need to import NumPy and calculate Z-score.
Z-score.is essentially calculated by (observation_value)-mean/standard deviation.
If the absolute value of the Z-score is greater than the threshold value as given by the user, then it is an outlier.

Another interesting feature that we explore is the “.unique” function.

Here we have applied this function to get the uniqueness in the salary which basically gives us the info on how many figures we have in total.

These elements will be displayed as an array of unique elements.

What if we thought of understanding the count of people possessing different salary figures?

For that, we've got to use the “.value_counts” function.

This tells us the vote count of each salary figure in descending order.
“value_count” has the most values concentrated in the salary of 300000.

Here is the value count of the salary

Let’s now explore data with beautiful graphs. Python has a visualization library, Seaborn which build on top of matplotlib. It provides very attractive statistical graphs in order to perform both Univariate and Multivariate analysis.

One can find correlations using pandas “.corr()” function and can visualize the correlation matrix using a heatmap in seaborn.

Dark shades represent a positive correlation while lighter shades represent a negative correlation.
If you set annot=True, you’ll get values by which features are correlated to each other in grid-cells.

Here is our correlation matrix

Till now we've been trying to explore data of 1 dimension.

Now, let's make some experiments on how to analyze the data considering 2 or more parameters.

It is quite obvious to know about considering the gender factor especially dealing with something related to the class of students. So let's begin with that.

Let's try to see how many males and females were placed to different companies and are salaried above 3lac.

Here is the summary sheet on the number of males who were salaried above 3lac with their salary details.

no. of males who were salaried above 3lac

Here is the summary sheet on the number of females who were salaried above 3lac with their salary details.

no. of females who were salaried above 3lac

Mean is perhaps the most important statistic in data because it forms the basis of conducting and understanding all other complex statistics. The mean is the “center of gravity” of your data and is meant to carry a piece of information from every member of the sample.

The mean of any characteristic is known by the function “.mean”.

So, let us try to understand the mean of the salary.

Mean of the salary

Now, the most important aspect of our analysis comes into the picture. And so the wait is over! Must be waiting to see how the trend goes to see on what parameters a student is placed? Yes.