Skip to content

Month: April 2019

The Ultimate Data Interview Checklist

Nervous for your Data Science / Data Engineering interview? Start here.

Data Science, Data Engineering, Business Intelligence, Data Analysis, and other related positions fall at an intersection of coding, databases, statistics, and business/product. This blend of subjects results in an engaging and challenging career. The interviews are similarly engaging and challenging. 🙂

When I was studying for my data interviews, I noticed there was not one “holy grail”. I was googling, “review SQL fast”, “data science interview questions”, “statistics questions”, “data model interview questions”, and more endlessly. I found a lot of resources and also a lot of gaps – which is why I wrote this list.

I’m putting together this list to help others who are taking their next step in their data career. While the title says, ‘ultimate’, I’m hoping to keep collecting resources from otherswho have also been through the process. I want to get feedback on what worked & what didn’t. I want to keep growing this list.

How can you help? Share this with your friends! And also reach out to me with resources that worked for you!

Data Structures & Algorithms

https://www.interviewcake.com

Interview Cake: My subscription to this service was the best money I ever spent. Thank goodness I did because it prepared me for every data structure and algorithm question that came my way. They take you through the theory and how to code it like no other resource I have seen. Be prepared though, it’s pricey but worth it.

Cracking the Coding Interview: This book is the best place to start learning and reviewing software engineering aspects of your data interviews. The reality is that software engineering fundamentals are generally expected of those of us in this field, this book really helps cover any gaps you may have.

Introduction to Algorithms: Be prepared to invest time in this one, but it is worth it. This textbook covers algorithms in depth. I re-coded the pseudo-code in Python and it was the perfect supplement to Interview Cake.

SQL

W3 Schools – SQL: This is the place to go if you have never written any SQL before. It lets you run and experiment with queries while learning SQL syntax. The most important thing here is to learn to visualize the data underneath each query.

Data Mastery – SQL: I wrote this one because I noticed a lot of the SQL resources are badly formatted or missing tricks of the trade. I designed it so technologists could review SQL syntax quickly, practice, and continue to use this book as a reference.

https://selectstarsql.com

Select Star SQL: This one was recommended to me by a software engineer. It has interactive practice questions that reach the subject of JOINS, which is great. This resource will also be very helpful for experimenting with queries and learning to visualize data.

Mode SQL Tutorial: I came across this on Quora somewhere as a great place to practice beginning to advanced SQL queries. This resource will be great for anyone who needs to ramp up to advanced topics and learn to visualize data transformations.

Python

Data Mastery – Python: I also wrote this one! This book is for anyone who wants to ramp up quickly on Python. It’s designed for technologists to keep this book close and use it as a reference.

Machine Learning Algorithms

Hands-On Machine Learning with Scikit-Learn and TensorFlow: This book is the #1 bestseller in multiple Machine Learning categories on Amazon & came to me through a recommendation. I haven’t read it yet but skimming through the table of contents it looks very robust.

Machine Learning A-Z: Hands on Python & R in Data Science: This is a course on Udemy, which I like because it has exercises and is more interactive than books. It’s great for starting out and doesn’t require you to download an obscure language.

Data Warehousing

Agile Data Warehouse Design: This book covers the fundamentals of data model design. It’s a good book to read if you are headed into a BI or data engineering interview.

Product & Analytics

Cracking the PM Interview: There’s a lot of soft stuff in here. For those of us who are not aiming to be a Product Manager a lot of it may not be useful. But check out the Company Research, Estimation Questions, Product Questions, and Case Questions chapters.

Lean Analytics: I put a post out asking for product analytics recommendations and this book came highly recommended from a few people. Disclosure: I have not gotten through all 440 pages. From what I have read I like that it dives into different kinds of metrics, how to think, and what to avoid. So far it seems like an important read to prepare for conversations that involve Product Thinking. And the book is recommended for Data Scientists by the authors themselves!

A/B Testing: One huge concern of mine is that a lot of product analytics books gloss over the tough stuff: math. And to get to the math needed for data work, you have to dive into a statistics book or a machine learning book which can be very broad. I was glad I read about this resource on Udacity that focuses specifically on A/B Tests which is an essential part of any data related position. Also it’s great because it is called A/B Testing by Google!

Statistics

Heard in Data Science Interviews: This one was recommended to me by an expert data scientist. This book covers a lot of topics but I was told the statistics section is worthwhile.

Practical Statistics for Data Scientists: This is a solid refresher. If your interview is predominantly statistics, you should invest in reading this book.


This list is incomplete!I’m looking for recommendations for more Machine Learning, Product Analytics, Big Data, and Coding (R & Python) that help prepare people for interviews.

Am I missing your favorite resource from this list? Did something else come up in your interviews that is not a subject covered here? Let me know.

Connect with me on Instagram @lauren__glassLinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on

Data Mastery – SQL

The SQL book experienced technologists have been waiting for!

There are a plethora of SQL resources out there. That’s great for anyone who has a lot of time on their hands. But for those of us who are deep into our technical careers, we need brevity & efficiency in all of our resources.

I noticed a trend among my fellow technologists – we each are especially good at one or two subjects. But we come into contact with many more. Many of you spend 99% of your time with JS, Python, Ruby, R, Go, or others. At some point you will come across a project or interview that requires SQL done right.

I wrote Data Mastery – SQL for you. This book is designed to teach the essentials, challenge you with real world scenarios, and provide solutions all under 100 pages (depending on the e-reader). It is perfect for data interview preparation, advanced software engineering projects which integrate data resources, and product managers and business professionals focused on being data-driven.

Available now on Amazon<- Buy it here! Hit me up for questions!

Connect with me on Instagram @lauren__glassLinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on

Trying to be data driven? Double check your SQL guides!

Originally posted on LinkedIn

Learning SQL is hard at even if you are an experienced programmer. You have to think in terms of relationships: what matches up with what and how do I aggregate correctly? It is even harder if you come across a resource that looks trustworthy but is actually full of mistakes.

I work as a Data Engineer at a FAANG companyand publish on Mediuma lot. My recent work is a SQL tutorialdesigned for people who already have technical skill but need to ramp up fast.

This week I got an email asking for help understanding two SQL examples from the Advanced SQL section of a seemingly normal SQL tutorial. The examples were around finding the Percent of Total and Cumulative Percent of Total.

I wrote this article to broadcast the issues I found and the correct answers to a greater audience. I hope this helps others who are on their journey to learning Data Science and Data Engineering.

The Resources Provided by the Tutorial in Question

The SQL tutorial provided a table:

The SQL tutorial provided a query on how to find the Percent of Total:

The SQL tutorial provided a query on how to find the Cumulative Percent of Total:

The Question I Received

“How does this work? I don’t understand WHERE a1.Sales <= a2.sales and I only see one table with two columns, so how do they get a2.Sales and a2.Name?”

I was a bit confused at first too because there are lots of problems with these exercises!

The First Issue: Use a Join

If you also were confused why there were two aliases but only one table, take a look at the code following the keyword FROM. You can see the table is referenced twice separated by a comma.

When you put two tables behind the word FROM it will create an INNER JOIN between them. In my opinion, this is bad practice. It is easier for everyone to understand if you use the standard syntax: INNER JOIN.

The Second Issue: Unused Code

In the Percent of Total query, there is a lot of unused code.

  • There is nothing in the SELECT statement uses the alias ‘a2’ 
  • There’s no apparent filtering purposes in the WHERE clause that explains the purpose of a2

You can achieve the same results with a simple SELECT/FROM query. See the image below on how to find the Percent of Total for each row:

Thanks TutorialsPointfor the awesome online editor

The Third Issue: Cumulative Percent of Total Doesn’t Work

In the Cumulative Percent of Total query, this is where this confusing WHERE clause is actually used.

What they are trying to do is for each sales number match up all the sales numbers equal or lesser to it and also include any records that are themselves. Then aggregate.

The logic they were going for was this:

But the query they wrote allows both Stella and Jeff to match up with each other because they have the same sales amount. So they actually ended up with this:

So when you run the query exactly, it doesn’t print out the same numbers as Results section says it should.

The Fourth Issue: Redundant Code

If you didn’t catch the flawed logic right away and are wondering how that got past you, don’t worry it was hard to read in the first place because there was redundant code.

The section after OR is actually redundant:

  • It is redundant since a1.Sales <= a2.Sales covers the a1.Sales=a2.Sales case
  • Additionally if a1.Name=a2.Name, then the sales would already have matched up so the a1.Sales=a2.Sales case covers this

Take a look, it returns the same results when I comment out the redundant code:

Cumulative Percent of Total: The Answer

So how can we actually find the Cumulative Percent of Total? What we need is to label each record with a unique number that also reflects numerical order.

This way we can JOIN the table back to itself and eliminate the possibility of a ‘tie’ which was our third issue (above). We basically need a row number. There is a window function called ROW_NUMBER() which is supported in many versions of SQL.

But unfortunately the editor I’m using is a release of SQLite that doesn’t include window functions yet. So I just wrote them in, hope that is ok 🙂

To find the Cumulative Percent of Total, we need to JOIN the table to itself and find for each record all numbers that come before that record. Here is the JOIN I wrote to accomplish this. In case you are a visual learner, I printed out what the data looks like before I aggregate it. This is so you can see that each row accumulates the info of the ones before it.

Then I apply the SUM() aggregation function and use GROUP BY to bring us back to being at the name level.

Unfortunately, not everything on the internet is true even sometimes online tutorials! If something stumps you, don’t automatically assume that you are missing something. The tutorial may be wrong! Reach out to someone you trust (if I have time, I’m happy to help)! If all else fails, ask on StackOverflow. 🙂

Share this article with a friend who is studying for data projects or positions! If you have questions feel free to leave them in the comments & I will try to get back to you.

Thanks for reading!

Connect with me on Instagram @lauren__glassLinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on

My Crypto Journey

Originally Published on Hackernoon

My career is in Data Engineering, I hold a Bachelors of Science in Economics, and until recently I have been a major skeptic of cryptocurrency. I wasn’t an early adopter because I felt there were economic weaknesses that made the investment too risky.

But after the crash, I came across an interesting perspective that changed my skepticism to interest. I wanted to share my change of mind so enthusiasts and skeptics could better understand their counterpart’s point of view.

First Impressions

I first came across Bitcoin circa 2013 when I was scrolling through news articles discussing its relationship to the Silk Road (website). I again came across it in 2015 when news outlets were claiming that Chinese elites had used Bitcoin to smuggle money out of the country before the Chinese Central Bank infamously devalued the yuan against the US dollar in one day.

During my perusing, two issues jumped out at me that kept me skeptical for a long time.

Limited Money Supply

A fair complaint about fiat currency is the government’s ability to just simply print more cash. This is a major cause of inflation and ultimately hurts the little guy’s purchasing power. Bitcoin wanted to head in the other direction: a pre-determined money supply that is released in decreasing amounts.

This is not a new system, and it does work. At one point our currency was gold-backed. There’s only so much gold on planet Earth. In order to get to it, you need to mine for it. The more gold that is mined, the harder it is to find more.

However, developing software is quite different than mining precious metals. While inside the Bitcoin economy the supply of money is finite, the supply of cryptocurrencies in general is theoretically infinite. Someone can introduce a new supply of money by:

  1. Introducing a new currency (Ethereum, Neo, Ripple, Stellar)
  2. Hard forks that split a currency:
  • June 2016: the Ethereum DAO hack and subsequent hard fork resulted in two currencies Ethereum & Ethereum Classic
  • July 2017: The hard fork that increased the size of a Bitcoin block to 8mb resulted in Bitcoin Cash and Bitcoin
  • October 2017: The hard fork to prevent ASIC mining operations resulted in a new currency Bitcoin Gold

I do not think the cryptocurrencies’ economies are insulated from each other — inflation could happen.

Monetary Policy

Another thing that concerned me is the very Laissez-faire approach to economics or in other words lack of monetary policy. I think some monetary policy is necessary to counteract a perfect storm of economic events. A perfect example is the recession following the dot com bubble. This recession followed a series of events:

  • Wild speculation on tech stocks by individual and institutional investors
  • Tech companies’ out of control burn rate (how fast they spent money)
  • A recession in Japan
  • Various scandals and lawsuits
  • The Federal Reserve raised interest rates
  • Sept 11th

In response to the crash, the Federal Reserve (Central US Bank) stepped in and lowered the interest rate. To put it simply, this makes money cheaper to borrow and is a way to get money flowing again.

Some of these events were under the US government’s control, some of them were quite the opposite. The advantage of a central authority is that someone can intervene when a financial meltdown occurs and prevent the situation getting worse for everyone. My perspective was and still is, that any economy needs to have safeguards built in.

I think it would be amazing to see monetary policy in an automated system. That way policy would not be set behind closed doors anymore. But it seems the major cryptocurrencies have not taken this step yet. And I think that is a mistake. Case in point…

The Crypto Crash

Late 2017 was shocking; the cryptocurrency crash happened almost as suddenly as the rise. In February 2018, it finally bottomed out and that is where I saw something interesting. Just interesting enough to make me climb down from my high horse and take a closer look: Bitcoin was up significantly year-over-year. And that’s what got my attention.

Someone, somewhere out there finds enough value in this to ride out the crash. So I went in search of their perspective. What did they see that I didn’t?

How I Was Won Over

A friend gave me a book called “Bitcoin and the Future of Money” by Jose Paglieryas a gift. Who doesn’t love the story about the mysterious origins of Bitcoin and the technology behind Blockchain?

The chapter that won me over was Chapter 6: The Case for Bitcoin. The majority of this chapter talked about the life experiences of being “unbanked” and “underbanked”.

  • “unbanked”: being without a bank account
  • “underbanked”: having a bank account but better served by alternative financial services like check cashing and payday loans stores which are notoriously expensive.

Pagliery’s figures on how many people are un/underbanked were staggering. At the time the book was published in 2014, more than 25% of American households and 30% of citizens in large American cities were in this situation. For many of us, transactions such as depositing a check, getting cash, or receiving a loan are seamless and relatively low cost. However they become expensive at lower incomes and can create a vicious cycle.

  • Depositing a Check: Banks often charge fees if an account is below a minimum balance. So an account is not practical for those who cannot maintain this amount. When people in this situation receive a paycheck, their best option is to go to a check cashing service which also charges sizable fees.
  • Getting Cash: Debit cards and vendors often won’t process small transaction amounts so having cash is still a necessity in order to stick to a low income budget. If you receive a government assistance card and you need to get cash, the ATM will slap you with fees.
  • Receiving a Loan: Say you miss work due to illness, a payday loan is the only service that will lend the small amount you need but with hefty interest rates and other fees.

Perhaps some of this could be mitigated by better financial planning and education. But ultimately people in this situation are paying large percentages of their income to use their own money. It is undeniable that our financial system is not set up to effectively process small transactions and service those with low income.

Thank you to Omer Goldbergfor the use of his photo displaying M-PESA signs at a local pharmacy in Kenya

The chapter went on to describe the breakthrough of M-PESA in Kenya which solves exactly this problem. M-PESA is a system that belongs to Kenya’s largest mobile network company which allows users to deposit money into a mobile account and then text money to each other. According to this chapter, 43% of Kenya’s GDP passed through M-PESA. This system was so successful because the number of Kenyans who have a phone far surpasses the amount who have a formal bank account.

I started to look around for the signs of underbanking at home. I live in Israel. Here the banks often require a 25% (or more) down payment on a home minimum. By comparison, in the US it is 3%. This 25% minimum down payment in conjunction with high real estate prices puts homeownership out of reach for many people and prevents the accumulation of wealth because paying rent is just a black hole. This is only part of the story with the banks and real estate here. I can only imagine what banking services look like for other places in the Middle East.

This chapter won me over because I was able to imagine the impact of an easy-to-join system that provides seamless and inexpensive financial services for everyone. Can you imagine what our worldwide economy would look like if that were so?

Looking Forward

I think an important step to achieve this is an automatic monetary policy mechanism to stabilize the exchange rate and other economic indicators. This would allow cryptocurrencies to take on more of a currency role instead of a speculative investment.

I am not alone in thinking this. A friend introduced me to The Dai Stablecoin Systemwhich has implemented an automated monetary policy system that stabilizes the Dai Stablecoin against the US Dollar. Pegging a coin to the US Dollar is a great way to start off the transition to a Blockchain system. But my hope is one day for these currencies to be stable and floating (allowing supply and demand to set the exchange rate) in their own right.

A lot of brainstorming, debate, and experimentation needs to be done to get us to that point. I look forward to see what new crypto-creations come out in the future and where the discussion takes us.

Thanks for reading! If you have any resources you think are interesting related to cryptocurrencies, please feel free to comment or reach out. And if you have questions feel free to comment & I will try to get back to you.

Connect with me on Instagram @lauren__glassLinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on

Fundamental Python Data Science Libraries: A Cheatsheet ~ Scikit-Learn

Originally Published on Hackernoon

If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.

  1. NumPy
  2. Pandas
  3. Matplotlib
  4. Scikit-Learn

The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.

Scikit-Learn

Scikit-Learn is built on top of NumPy, SciPy, and matplotlib. It contains an extensive collection of ready-to-use Machine Learning algorithms. All the algorithms are well documented and easy to use for all experience levels.

Focus of the Library

This library contains many powerful algorithms, each are their own objects with certain parameters and methods.

Installation

Open a command line and type in

Windows: in the past I have found installing NumPy & other scientific packages to be a headache, so I encourage all you Windows users to download Anaconda’s distribution of Python which already comes with all the mathematical and scientific libraries installed.

Details

You can see on their homepage, Scikit-Learn is split into a couple of subjects: Classification, Regression, Clustering, Dimensionality Reduction, Model Selection, and Preprocessing.

The first three subjects (Classification, Regression, and Clustering) are all types of algorithms. While the last three (Dimensionality Reduction, Model Selection, and Preprocessing) are subjects that alter and analyze your data so it works better when fed to an algorithm. This article focuses on the first three subjects.

Supervised Learning

Both Classification and Regression algorithms fall under a section of Machine Learning called “Supervised Learning”. What this means is these two types of algorithms have something in common: the data fed into the algorithms has observational data and targets (or a results or ‘answers’).

A standard example of this is home data and their sale prices. The observational data is the features of each home. The target is the price the home got on the market.

We then can use Supervised Learning to predict the price of any home as long as we have features about it.

  • Classification data means the target values are discrete (such as labeling a house as either expensive or inexpensive)
  • Regression means the results are continuous (the actual price of the home in dollars which could be any positive number).

Unsupervised Learning

The Clustering section is also known as “Unsupervised Learning”. This means that we have observational data but no targets. Instead we want to use an algorithm to find groups inside the observational data and create labels. A common example is user segmentation in sales or traffic data.

Creation

First, you need to examine if your data fits a Classification, Regression, or Clustering scenario. Then go to the corresponding section of Scikit-Learn and select an algorithm. As mentioned before, each algorithm is its own object.

We will start with a very simple example in order to get a feel for the library: Linear Regression.

We are going to use one of the Scikit-Learn’s built-in datasets as a walk-through in creating, fitting, and predicting using this model.

Note: this is not intended to be an in-depth analysis of Linear Regression, just a quick example.

Normally, you will initialize the algorithm object with the parameters you want. There are defaults for many just like we see here. However I recommend researching what each of the parameters mean to make sure you have made the right choice.

We are going to put the data into a pandasDataFrame to make separating the data into training and testing sets straightforward.

Now we are ready to fit the model.

.fit Method

I’m going to take the training data and put it into the Linear Regression algorithm by using the .fit method. This method will calculate the underlying linear equation that best fits the data.

That’s it! Nice, so how did we do?

.predict Method

In order to evaluate how well our algorithm is able to make a prediction based only on observational data, we use the .predict method. We will use the test set of observational data to make predictions.

Then see how those predictions compare to the actual targets by looking at the R2 and MSE metrics.

We will skip over checking the plot of the residuals and just look at these metrics. The metrics tell us this model is ok. We were able to explain about 70% variance in the target with our model. If we were to run this model with different combinations of columns in our observational data, the mean squared error metric would help us compare between models.

Applications

Cool! So you have seen that this library has algorithm objects and each will have a fit method. If the algorithm you are using is a Regression or Classification algorithm it will also have a predict method.

Each algorithm may differ so be sure to read the documentation.

Let’s apply the walkthrough we just did to our real-life Bitcoin scenario. In my article on pandas, we acquired data on Bitcoin and created a signal for when Bitcoin’s price had dipped below it’s rolling 30 day average. In my last article, we used matplotlib to plot the data.

Say we are a cautious investors and therefore watching the 30 day rolling average is not a good enough analysis. Is there a better way to examine market behavior?

I once came across a financial product that utilized Clustering to help traders visualize groups of similar market behavior. I thought this was intriguing because building accurate models using Supervised Learning often require data out of the average person’s reach. Clustering allows anyone to find patterns in what they have available.

So let’s see if we can leverage Clustering to discover patterns in Bitcoin prices. I procured a small Bitcoin datasetfrom Quandl(you’ll need a account). This dataset includes about 3 months worth of Bitcoin prices. I picked this dataset because it has market open and close prices.

Let’s see if there are groups of market highs and lows based on the Bitcoin price when the market opens. If yes, perhaps we have a market strategy on our hands!

Here are the libraries we need:

Here’s my code to set up the data:

Here is a function for making the visualizations we need to examine the data:

Here’s what our data looks like:

Now that’s a linear relationship if I’ve ever seen one. But my sense is that a quite a few other data points would be needed to build a super accurate Regression model with this data, so let’s stick to what a Clustering algorithm can tell us.

Scikit-Learn has many Clustering algorithms available. We will use DBSCANbecause we don’t know how many clusters there should be and we want to focus on areas with a concentration of data points. There are other clustering algorithms that you could use based on how you want to structure your strategy.

Here’s the code for building a Clustering algorithm for both Open vs High and Open vs Low data.

Epsilon: what is the maximum distance between two data points in order for them to still be in the same cluster. I picked $150 for this value.

So what did we find?

Our clusters are groups of similar market behavior. The black dots are outliers that do not belong to a cluster.

Let’s look at each cluster and find the average difference between the High/Low price and Open price.

The results above tell us is that if one day the market opens with a Bitcoin price around $6,500, similar data points saw an average High price +$139 and an Low price -$113 from Open in the day.

So now what? The next step is to put what we learned into a system that tests and executes trading strategies automatically. Next steps are on you! Good luck!

Thanks for reading! If you have questions feel free to comment & I will try to get back to you.

Connect with me on Instagram @lauren__glassLinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on

Fundamental Python Data Science Libraries: A Cheatsheet ~ Matplotlib

Originally published on Hackernoon

If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.

  1. NumPy
  2. Pandas
  3. Matplotlib
  4. Scikit-Learn

The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.

Matplotlib

This library is the go-to Python visualization package (except for Plotly which is paid)! It allows you to create rich images displaying your data with Python code.

Focus of the Library

This library is extensive, but this article will focus on two objects: the Figure and the Axes.

Installation

Open a command line and type in

Windows: in the past I have found installing NumPy & other scientific packages to be a headache, so I encourage all you Windows users to download Anaconda’s distribution of Python which already comes with all the mathematical and scientific libraries installed.

Details

Matplotlib is split into two main sections: the Pyplot API(visualization functions for fast production) and the Object Oriented API(more flexible and robust).

We will focus on the latter.

Let’s dive in!

Creation

In order to make a visualization, you need to create 2 objects one right after the other. First create a Figure object and then from that, create an Axes object. After that, all visualization details are created by calling methods.

Some things to note about the Figure object:

  • The figsize & dpi parameters are optional
  • figsize is the width and height of the figure in inches
  • dpi: is the dots-per-inch (pixel per inch)

Some things to note about the add_axes method:

  • The position of the axes can only be specified in fractions of the figure size
  • There are many other parameters that you can pass to this method

Plotting

Now we are going to create some simple data, plot it, label the graph, and save it to the same directory as where our code lives.

Here is the resulting image:

Legends

The best way to add a legend is to include the label keyword when you call the plot method on the Axes object (as we saw in the code above). Then you can make a legend and choose its location by calling another method.

Here is the resulting image:

Colors & Lines

You can control features of the lines by passing certain keyword arguments into the plot method. Some of the most commonly used keywords are:

  • color: either passing the name (“b”, “blue”, “r”, “red”, etc) or a hex code (“#1155dd”, “15cc55”)
  • alpha: transparency of the line
  • linewidth
  • linestyle: pattern of the line (‘-’, ‘-.’, ‘:’, ‘steps’)
  • marker: pattern for each data point on the line (‘+’, ‘o’, ‘*’, ‘s’, ‘,’, ‘.’)
  • markersize

Here is the resulting image:

Axes Range & Tick Marks

You can also control the range of the axes and override the tick lines of your graph.

Here is the resulting image:

Subplots

So far we have created a Figure object with only one graph on it. It is possible to create multiple graphs on one Figure all in one go. We can do this using the subplots function.

Here is the resulting image:

I’m providing here a link to download my Matplotlib walkthrough using a Jupyter Notebook!

Never used Jupyter notebooks before? Visit their website here.

Applications

In my last article on pandas, we acquired data on Bitcoin and created a signal for when to buy and trade based on the rolling 30 day average price. We can use our new knowledge in Matplotlib to visualize this data.

You’ll need a Quandlaccount and the python Quandl library.

Code from last time:

New code to visualize bitcoin data:

Here is the resulting image:

That’s Matplotlib! Fast, flexible, and easy visualizations with real data. But what if we wanted to analyze the data with something more sophisticated than a rolling 30 day average? The last library every Python data-oriented programmer needs to know is Scikit-Learn — learn about it in my next article!

Connect with me on Instagram @lauren__glassLinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on

Fundamental Python Data Science Libraries: A Cheatsheet ~ Pandas

Originally published on Hackernoon

If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.

  1. NumPy
  2. Pandas
  3. Matplotlib
  4. Scikit-Learn

The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.

pandas

This library is built on top of NumPy, which you may remember from my last article. Pandas takes NumPy’s powerful mathematical array-magic one step further. It allows you to store & manipulate data in a relational table structure.

Focus of the Library

This library focuses on two objects: the Series (1D) and the DataFrame (2D). Each allow you to set:

  • an index — that lets you find and manipulate certain rows
  • column names — that lets you find and manipulate certain columns

Having SQL deja-vu yet?

Installation

Open a command line and type in

Windows: in the past I have found installing NumPy & other scientific packages to be a headache, so I encourage all you Windows users to download Anaconda’s distribution of Pythonwhich already comes with all the mathematical and scientific libraries installed.

Details

A pandas data structure differs from a NumPy array in a couple of ways:

  1. All data in a NumPy array must be of the same data type, a pandas data structure can hold multiple data types
  2. A pandas data structure allows you to name rows and columns
  3. NumPy arrays can reach multiple dimensions, pandas data structures limit you to just 1 & 2D.*

*there is a 3D pandas data structure called a Panelbut it is depreciated

Let’s dive in!

Creation

It’s very simple!

You can create a Series or DataFrame from a list, tuple, NumPy array, or even a dictionary! Oh and of course from CSVs and databases.

The print out you see above has two columns. The one on the left is the index and the one on the right is your data. This index looks like the indexes we are used to when using lists, tuples, arrays, or any other iterable. We will see soon in pandas we can change it to whatever we like!

The print out you see above has a ton of numbers. The first column on the left is the index. The top row is the columns names (for now 0…5). Again, we will see soon in pandas we can change it to whatever we like!

From a Dictionary

The dictionary keys will become the index in a Series

It works a bit differently in a DataFrame — the keys become the column names

Upload data

Pandas has many ways to upload data, but let’s focus on the standard csv format.

The keyword argument, index_col, is where you can specify which column in your CSV should be the index in the DataFrame. For more details on the read_csv function, go here.

I love that the pandas library only requires 1 line to import data from a CSV. Who else is over copying and pasting the same lines of code from the csv library? 😉

Use the Index

Your days of text wrangling are over! No more weird list comprehensions or for loops with comments like “# extract this column during given period” or “# sorry for the mess”.

Here is an example DataFrame:

Indexing a Column

Indexing a Row

Indexing multiple axes – names

Indexing multiple axes – numbers

View Your Data

Quickly check the top and bottom rows:

View summary statistics before you dash off for a meeting:

Control Your Data

Pandas brings the flexibility of SQL into Python.

Sort

Join

Here are new example DataFrames:

If you want to join on a column other than the index, check out the merge method.

Group by

Accessing Attributes

Notice how I was able to just add in a column using a key/value notation in the code above? Pandas allows you to add new data with ease. But it also allows you to access the core attributes of your data structures.

Access the Index

Access the Values

Access the Columns

I’m providing here a link to download my pandas walkthrough using a Jupyter Notebook!

Never used Jupyter notebooks before? Visit their website here.

Overall, if you have a dataset you want to manipulate but don’t want to go to the hassle of hauling it all into SQL, I recommend searching for a pandas solution before anything else!

Applications

Let’s look at a scenario. Say you wanted to keep an eye on Bitcoinbut don’t want to invest too much time in building out an infrastructure. You can use pandas to keep it simple.

You’ll need a Quandl account and the python Quandl library.

Let’s code:

This is the power of pandas with real life data! However, what if we wanted to view the data shown above in a graph? That’s possible, check out my next article on Matplotlib.

Thanks for reading! If you have questions feel free to comment & I will try to get back to you.

Connect with me on Instagram @lauren__glass & LinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on

Fundamental Python Data Science Libraries: A Cheatsheet ~ Numpy

Originally published on Hackernoon

If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.

  1. NumPy
  2. Pandas
  3. Matplotlib
  4. Scikit-Learn

The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.

NumPy

Just as it is written on NumPy’s website, this library is fundamental for scientific computing in Python. It includes powerful manipulation and mathematical functionality at super fast speeds.

Focus of the Library

This library is all about the multidimensional array. It is similar in appearance to a list & indexes like a list, but carries a much more powerful set of tools.

Installation

Open a command line and type in:

Windows: in the past I have found installing NumPy to be a headache, so I encourage all you Windows users to download Anaconda’s distribution of Python which already comes with all the mathematical and scientific libraries installed.

Details

A NumPy array differs from a list in a couple of ways.

  1. All data in a NumPy array must be of the same data type, a list can hold multiple
  2. A NumPy array is more memory efficient & faster! See a detailed explanation here
  3. Lists don’t have as many powerful mathematical methods and attributes built in! — super useful for data exploration and development.

Let’s dive in!

Creation

You can create an array in a couple of different ways.

From a list or tuple

With placeholder content

With a sequence

Upload data

Makes Math Easy

You can do all sorts of mathematical operations on the whole array. No looping required! A new array will be made with the results.

Attributes & Methods

Beyond just mathematical operations, NumPy comes with a plethora of powerful functionality that you can leverage to save yourself time & increase readability.

Summary Statistics

Additionally, there are .max(), .min(), .sum(), and plenty more.

Reshape

More Math

There are many more (too many to list) mathematical methods available. Dot is just my favorite.

I’m providing here a link to download my NumPy walkthrough using a Jupyter Notebookfor everything we covered and more!

Never used Jupyter notebooks before? Visit their website here.

Overall, if you have complex transformations you need to do on lists of data, I recommend searching for a NumPy solution before coding something yourself. This will save you many a headache.

Applications

Let’s look at a scenario. Say I was able to export trading transactions: buys & sells. I want to see how much cash I had on hand after each transaction.

This is a version with very simple, fictional data. However, what if we wanted to work with the data shown above but with the dates next to them? That’s possible, check out my next article on pandas.

Thanks for reading! If you have questions feel free to comment & I will try to get back to you.

Thanks for reading! If you have questions feel free to comment & I will try to get back to you.

Connect with me on Instagram @lauren__glassLinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on

Mastering Python Web Scraping: Get Your Data Back

Do you ever find yourself in a situation where you need to get information out of a website that conveniently doesn’t have an export option?

This happened to a client of mine who desperately needed lists of email addresses from a platform that did not allow you to export your own data and hid the data behind a series of UI hurdles. This client was about to pay out the nose for a data-entry worker to copy each email out by hand. Luckily, she remembered that web scraping is the way of the future and happens to be one of my favorite ways to rebel against “big brother”. I hacked something out fast (15 minutes) and saved her a lot of money. I know others out there face similar issues. So I wanted to share how to write a program that uses the web browser like you would and takes (back) the data!

We will practice this together with a simple example: scraping a Google search. Sorry, not very creative 🙂 But it’s a good way to start.

Requirements

Python (I use 2.7)

  • Splinter (based on Selenium)
  • Pandas

Chrome & Chromedriver

If you don’t have Pandas and are lazy, I recommend heading over to Anaconda to get their distribution of Python that includes this essential & super useful library.

Otherwise, download it with pip from the terminal/command line & all of its dependencies

If you don’t have Splinter (and are not using Anaconda’s Python), simply download it with pip from the terminal/command line.

If you want to set this up in a virtual environment (which has many advantages) but don’t know where to start, try reading our other blog post about virtual environments.

Step 1: The Libraries & Browser

Here we will import all the libraries we need and set up a browser object.

If the page you are trying to scrape is responsive, use set_window_size to ensure all the elements you need are displayed.

The code above will open a Google Chrome browser. Now that the browser is all set up, let’s visit Google.

Step 2: Explore the Website

Great, so far we have made it to the front page. Now we need to focus on how to navigate the website. There are two main steps to achieving this:

  1. Find something (an HTML element)
  2. Perform an action on it

To find an HTML element you need to use the Chrome developer tools. Right click on the website and select “Inspect”. This will open a box on the right side of the Chrome browser. Then click on the inspect icon (highlighted in red).

Next use the inspector cursor to click on a section of the website that you want to control. When you have clicked, the HTML that creates that section will be highlighted on the right. In the photo below, I have clicked on the search bar which is an input.

Next right click on the HTML element, and select under “Copy” -> “Copy XPath”

Congrats! You’ve now got the keys to the kingdom. Let’s move on to how to use Splinter to control that HTML element from Python.

Step 3: Control the Website

That XPath is the most important piece of information! First, keep this XPath safe by pasting into a variable in Python.

Next we will pass this XPath to a great method from the Splinter Browser object: find_by_xpath(). This method will extract all the elements that match the XPath you pass it and return a list of Element objects. If there is only one element, it will return a list of length 1. There are other methods such as find_by_tag(), find_by_name(), find_by_text(), etc.

The code above now gives you navigation of this individual HTML element. There are two useful methods I use for crawling: fill() and click()

The code above types CodingStartups.com into the search bar and clicks the search button. Once you execute the last line, you will be brought to the search results page!

Tip: Use fill() and click() to navigate login pages 😉

Step 4: Scrape!

For the purpose of this exercise, we will scrape off the titles and links for each search result on the first page.

Notice that each search result is stored within a h3-tag with a class “r”. Also take note that both the title and the link we want is stored within an a-tag.

The XPath of that highlighted a tag is:

//*[@id=”rso”]/div/div/div[1]/div/div/h3/a

But this is just the first link. We want all of the links on the search page, not just the first one. So we are going to change this a bit to make sure our find_by_xpath method returns all of the search results in a list. Here is how to do it. See the code below:

This XPath tells Python to look for all h3-tags with a class “r”. Then inside each of them, extract the a-tag & all its data.

Now, lets iterate through the search result link elements that the find_by_xpath method returned. We will extract the title and link for each search result. It’s very simple:

Cleaning the data in search_result.textcan sometimes be the most frustrating part. Text on the web is very messy. Here are some helpful methods for cleaning data:

.replace()

.encode()

.strip()

All of the titles and links are now in the scraped_data list. Now to export our data to csv. Instead of the csv library chaos, I like to use a pandas dataframe. It’s 2 lines:

The code above creates a csv file with the headers Title, Link and then all of the data that was in the scraped_data list. Congrats! Now go forth and take (back) the data!

In case you want a big picture view, here is the full code available on our GitHub account.

Thanks for reading! If you have questions feel free to comment & I will try to get back to you.

Connect with me on Instagram @lauren__glassLinkedIn

Check out my essentials list on Amazon

View this post on Instagram

A post shared by Lauren Glass | לורן גלאס (@lauren__glass) on