The EPA Cuts Are Going to Hurt Arizona

There are some things in life that are so common-place that people don’t think about them. For most people, breathing is one of those things. It’s such a routine and natural action that it completely…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Applying Principal Component Analysis

Integrate PCA to your production application

In case you are here the first time, you may want to go through my previous deep dives into principal component analysis. Take a look at my tutorial I and tutorial II.

To recap, Principal Component Analysis is a way to reduce the dimensions in our data set. This should make our computations faster and help us make better predictions as well.

Now that you a fair idea on how PCA works and want to implement this in your production models, you may want to see how to implement this. Let’s see how we can do that.

Let us first call all the dependencies that we will be using.

We will use the same data set that we have used in the previous tutorial.

Note, there are 51 observations of hitech, 55 observations of bhagyanagar and 53 observations of hudco. This is important as we will soon use this.

We will now need to combine the three data sets so that we run our calculations on all three of them. We will take only a few of the columns that we think are relevant for our analysis.

We will now define the output for our data set. Let’s say that this is a classification problem where we are trying to say if an incoming test data set belongs to one of the three companies. To denote the three companies we come up with imaginary numbers (1000, 2000, 3000) . These numbers are taken in 1000’s so that there is no ambiguity or overlap between the two.

Now, while defining the output or what we will call our y , we will use the shapes that we had found out before. For example since there are 51 observations of hitech, there will be 51 1000’s. The following code should capture the idea.

Now, that are the preprocessing on the data is done, let’s come to the main story line. Let’s say in production, you have a model based on RandomForestClassifier running.

These are giving some accurate results. Although in the example taken here the matrices are pretty small, let’s say that in our ‘real’ world scenario the matrices are pretty huge and they take up a lot of computational power. We also know that a lot of features in our data set are correlated to each other.

So let’s perform Principal Component Analysis and reduce the dimensions of our data set to two dimensions. But to do that we will need to scale the data set.

Note that the mean resultant mean is almost 0. This is just to make sure that the scaling is proper.

We will now run the classifier on the scaled model.

Let’s check how it measures up to the previous model. We can see that by getting the scores in this case.

Interestingly, the score on the normal predictions is 1.0 which is probably due because our data set is small. The PCA predictive score is 96% which is pretty decent although probably we can make this better. Verifying manually on a small sample gives us identical results.

The EPA Cuts Are Going to Hurt Arizona

Applying Principal Component Analysis

Integrate PCA to your production application

Add a comment

Related posts:

Meaning Matters

A Visit From Death

Tutorial Panduan Testnet InsureDAO