Fundamental Python Data Science Libraries – Scikit-Learn
If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.
The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.
Scikit-Learn (website here) is built on top of NumPy, SciPy, and matplotlib. It contains an extensive collection of ready-to-use Machine Learning algorithms. All the algorithms are well documented and easy to use for all experience levels.
Focus of the Library
This library contains many powerful algorithms, each are their own objects with certain parameters and methods.
Open a command line and type in
Windows: in the past I have found installing NumPy & other scientific packages to be a headache, so I encourage all you Windows users to download Anaconda’s distribution of Python which already comes with all the mathematical and scientific libraries installed.
You can see on their homepage, Scikit-Learn is split into a couple of subjects: Classification, Regression, Clustering, Dimensionality Reduction, Model Selection, and Preprocessing.
The first three subjects (Classification, Regression, and Clustering) are all types of algorithms. While the last three (Dimensionality Reduction, Model Selection, and Preprocessing) are subjects that alter and analyze your data so it works better when fed to an algorithm. This article focuses on the first three subjects.
Both Classification and Regression algorithms fall under a section of Machine Learning called “Supervised Learning”. What this means is these two types of algorithms have something in common: the data fed into the algorithms has observational data and targets (or a results or ‘answers’).
A standard example of this is home data and their sale prices. The observational data is the features of each home. The target is the price the home got on the market.
We then can use Supervised Learning to predict the price of any home as long as we have features about it.
- Classification data means the target values are discrete (such as labeling a house as either expensive or inexpensive)
- Regression means the results are continuous (the actual price of the home in dollars which could be any positive number).
The Clustering section is also known as “Unsupervised Learning”. This means that we have observational data but no targets. Instead we want to use an algorithm to find groups inside the observational data and create labels. A common example is user segmentation in sales or traffic data.
First, you need to examine if your data fits a Classification, Regression, or Clustering scenario. Then go to the corresponding section of Scikit-Learn and select an algorithm. As mentioned before, each algorithm is its own object.
We will start with a very simple example in order to get a feel for the library: Linear Regression.
We are going to use one of the Scikit-Learn’s built-in datasets as a walk-through in creating, fitting, and predicting using this model.
Note: this is not intended to be an in-depth analysis of Linear Regression, just a quick example.
Normally, you will initialize the algorithm object with the parameters you want. There are defaults for many just like we see here. However I recommend researching what each of the parameters mean to make sure you have made the right choice.
We are going to put the data into a pandasDataFrame to make separating the data into training and testing sets straightforward.
Now we are ready to fit the model.
I’m going to take the training data and put it into the Linear Regression algorithm by using the .fit method. This method will calculate the underlying linear equation that best fits the data.
That’s it! Nice, so how did we do?
In order to evaluate how well our algorithm is able to make a prediction based only on observational data, we use the .predict method. We will use the test set of observational data to make predictions.
We will skip over checking the plot of the residuals and just look at these metrics. The metrics tell us this model is ok. We were able to explain about 70% variance in the target with our model. If we were to run this model with different combinations of columns in our observational data, the mean squared error metric would help us compare between models.
Cool! So you have seen that this library has algorithm objects and each will have a fit method. If the algorithm you are using is a Regression or Classification algorithm it will also have a predict method.
Each algorithm may differ so be sure to read the documentation.
Let’s apply the walkthrough we just did to our real-life Bitcoin scenario. In my article on pandas, we acquired data on Bitcoin and created a signal for when Bitcoin’s price had dipped below it’s rolling 30 day average. In my last article, we used matplotlib to plot the data.
Say we are a cautious investors and therefore watching the 30 day rolling average is not a good enough analysis. Is there a better way to examine market behavior?
I once came across a financial product that utilized Clustering to help traders visualize groups of similar market behavior. I thought this was intriguing because building accurate models using Supervised Learning often require data out of the average person’s reach. Clustering allows anyone to find patterns in what they have available.
So let’s see if we can leverage Clustering to discover patterns in Bitcoin prices. I procured a small Bitcoin datasetfrom Quandl(you’ll need a account). This dataset includes about 3 months worth of Bitcoin prices. I picked this dataset because it has market open and close prices.
Let’s see if there are groups of market highs and lows based on the Bitcoin price when the market opens. If yes, perhaps we have a market strategy on our hands!
Here are the libraries we need:
Here’s my code to set up the data:
Here is a function for making the visualizations we need to examine the data:
Here’s what our data looks like:
Now that’s a linear relationship if I’ve ever seen one. But my sense is that a quite a few other data points would be needed to build a super accurate Regression model with this data, so let’s stick to what a Clustering algorithm can tell us.
Scikit-Learn has many Clustering algorithms available. We will use DBSCANbecause we don’t know how many clusters there should be and we want to focus on areas with a concentration of data points. There are other clustering algorithms that you could use based on how you want to structure your strategy.
Here’s the code for building a Clustering algorithm for both Open vs High and Open vs Low data.
Epsilon: what is the maximum distance between two data points in order for them to still be in the same cluster. I picked $150 for this value.
So what did we find?
Our clusters are groups of similar market behavior. The black dots are outliers that do not belong to a cluster.
Let’s look at each cluster and find the average difference between the High/Low price and Open price.
The results above tell us is that if one day the market opens with a Bitcoin price around $6,500, similar data points saw an average High price +$139 and an Low price -$113 from Open in the day.
So now what? The next step is to put what we learned into a system that tests and executes trading strategies automatically. Next steps are on you! Good luck!
Thanks for reading! If you have questions feel free to comment & I will try to get back to you.
Check out my essentials list on Amazon