Majority of languages like JavaScript, Java, C#, C++, do not have great libraries to perform ML.
Rewriting the whole code in the language that the software engineering folks work..might seems like a good idea, but the time & energy required to get those intricate models replicated would be utterly waste.
One would be wise to stay away from it.
API-first approach – Web APIs have made it easy for cross-language applications to work well. If a frontend developer needs to use your ML Model to create an ML powered web application, they would just need to get the URL Endpoint from where the API is being served.
Turning Machine Learning Models into APIs in Python Learn to how to create a simple API from a machine learning model in Python using Flask.
Consider the following situation:
You have built a super cool machine learning model that can predict if a particular transaction is fraudulent or not. Now, a friend of yours is developing an android application for general banking activities and wants to integrate your machine learning model in their application for its super objective.
But your friend found out that, you have coded your model in Python while your friend is building his application in Java. So? Won't it be possible to integrate your machine learning model into your friend's application?
Fortunately enough, you have the power of APIs. And the above situation is one of the many where the need of turning your machine learning models into APIs is extremely important. Many of the industries are now looking for Data Scientists who can do this. Now, wrapping a machine learning model into an API is not very difficult, and that is precisely what you will be doing in this tutorial - Turn your machine learning model into an API.
Specifically, you will be covering the following:
Options to implement machine learning models
What are APIs?
Flask basics
Creating a machine learning model
Saving the machine learning model: Serialization & Deserialization
Creating an API from a machine learning model using Flask
Testing your API in Postman
Options to implement Machine Learning models
Most of the times, the real use of your machine learning model lies at the heart of an intelligent product – that may be a small component of a recommender system or an intelligent chat-bot. These are the times when the barriers seem very difficult to overcome.
For example, the majority of the ML practitioners use R/Python for their experiments. But consumers of those ML models would be software engineers who use a completely different technology stack. There are two ways via which this problem can be solved:
Rewriting the whole code in the language that the software engineering folks work. The above seems like a good idea, but the time & energy required to get those intricate models replicated would be utterly waste. Majority of languages like JavaScript, do not have great libraries to perform ML. One would be wise to stay away from it.
API-first approach – Web APIs have made it easy for cross-language applications to work well. If a frontend developer needs to use your ML Model to create an ML powered web application, they would just need to get the URL Endpoint from where the API is being served.
Now, before going any further let's study what really is an API.
What are APIs?
"In simple words, an API is a (hypothetical) contract between 2 softwares saying if the user software provides input in a pre-defined format, the later with extend its functionality and provide the outcome to the user software." - Analytics Vidhya
You can read the following articles to understand why APIs are a popular choice among developers:
Essentially, APIs are very much like web applications, but instead of giving you a nicely styled HTML page, APIs tend to return data in a standard data-exchange format such as JSON, XML, etc. Once you a developer has the desired output they can style it whatever the way they want. There are many popular ML APIs as well for example - IBM Watson's ML API which is capable of the following:
Machine Translation - Helps translate text in different language pairs.
Message Resonance – To find out the popularity of a phrase or word with a predetermined audience.
Question and Answers - This service provides direct answers to the queries that are triggered by primary document sources.
User Modelling – To make predictions about social characteristics of someone from a given text.
Google Vision API is also an excellent example which provides dedicated services for Computer Vision tasks. Click here to get an idea of what can be done using Google Vision API.
Basically what happens is a majority of the cloud providers, and smaller machine learning focused companies provide ready-to-use APIs. They cater to the needs of developers/businesses that do not have expertise in ML, who want to implement ML in their processes or product suites.
Popular examples of machine learning APIs suited explicitly for web development stuff are DialogFlow, Microsoft's Cognitive Toolkit, TensorFlow.js, etc.
Now that you have a fair idea of what APIs are, let's see how you can wrap a machine learning model (developed in Python) into an API in Python.
Flask - A web services' framework in Python:
Now, you might think what is a web service? Web service is a form of API only that assumes that an API is hosted over a server and can be consumed. Web API, Web Service - these terms are generally used interchangeably.
Coming to Flask, it is a web service development framework in Python. It is not the only one in Python, there couple others as well such as Django, Falcon, Hug, etc. But you will use Flask for this tutorial. For learning about Flask, you can refer to these tutorials.
If you downloaded the Anaconda distribution, you already have Flask installed. Otherwise, you will have to install it yourself with:
pip install flask
Flask is very minimal. Flask is favorite with Python developers for many reasons. Flask framework comes with an inbuilt light-weighted web server which needs minimal configuration, and it can be controlled from your Python code. This is one of the reasons why it is so popular.
Following code demonstrate Flask's minimality in a nice way. The code is used to create a simple Web-API which upon receiving a particular URL produces a specific output.
from flask import Flask app = Flask(__name__) @app.route("/") def hello(): return "Welcome to machine learning model APIs!" if __name__ == '__main__': app.run(debug=True)
Once executed, you can navigate to the web address (enter the address on a Web-Browser), which is shown on the terminal, and observe the result.
Some points:
Jupyter Notebooks are great for anything related to markdowns, R and Python. But when it comes to building a web server, it may show inconsistent behavior. So, it is a good idea to write the Flask codes in a text editor like Sublime and run the code from the terminal/command prompt.
Make sure you don't name the file as flask.py.
Flask runs on port number 5000 by default. Sometimes, the Flask server starts on this port number successfully, but when you hit the URL (that the servers return on the terminal) in a web browser or any API-client like Postman, you may not get the output. Consider the following situation:
According to Flask, its server has started successfully on port 5000, but when the URL was fired in the browser, it didn't return anything. So, this can be a possible case of port number conflict. In this case, changing the default port 5000 to your desired port number would be a good choice. You can do that just by doing the following: app.run(debug=True,port=12345)
In that case, the Flask server would look something like the following:
Now, let's go through step by step of the code that you wrote:
You created an instance of the Flask class and passed in the "name" variable (which is filled by Python itself). This variable will be "main", if this file is being directly run through Python as a script. If you imported the file instead, the value of "name" would be the name of the file which you imported. For example, if you had test.py and run.py, and you imported test.py into run.py the "name" value of test.py will be test (app = Flask(test)).
Above hello() method definition, there is @app.route("/"). route() is a decorator that tells Flask what URL should trigger the function defined as hello().
The hello() method is responsible for producing an output (Welcome to machine learning model APIs!) whenever your API is properly hit (or consumed). In this case, hitting a web-browser with localhost:5000/ will produce the intended output (provided the flask server is running on port 5000).
You will now study some of the factors that you will need to keep in mind if you are turning your machine learning models (built using scikit-learn) into a Flask API.
Scikit-learn models with Flask
Creating very simple to very complex machine learning models have never been this easy in Python with scikit-learn. But there are some points you will have to remember about scikit-learn:
Scikit-learn is a Python library which provides simple and efficient tools for data mining and data analysis. Scikit-learn has the following major modules:
Clustering
Regression
Classification
Dimensionality Reduction
Model selection
Preprocessing
(Be sure to check DataCamp's Supervised Learning with scikit-learn course which is taught by the core developer of scikit-learn - Andreas Müller)
Scikit-learn provides the support of serialization and de-serialization of the models that you train using scikit-learn. This saves you the time to retrain a model. With a serialized copy of your model made using scikit-learn you can write a Flask API.
Scikit-learn models require the data to be in numerical format. That is why, if the dataset contains categorical features that are non-numeric, it is important to convert them into numeric ones. For this transformation, scikit-learn provides utilities like LabelEncoder, OneHotEncoder, etc. These can be found in sklearn.preprocessing module.
Scikit-learn models cannot handle missing values implicitly. You need to handle missing values in your dataset by yourself, and then you can feed it to your model. For handling missing values, scikit-learn provides a wide range of utilities which can be found from sklearn.preprocessing module.
Label encoding and missing values are important data preprocessing steps which are very essential for building a good machine learning model. If you want to learn more on this, be sure to check the following course offered by DataCamp:
For this tutorial, you will use the Titanic dataset which is one of the most popular datasets for many reasons such as - the dataset contains a well great different types of variables, and the dataset contains missing values, etc. This DataCamp tutorial covers an excellent analysis of the dataset, and the dataset can be downloaded from here.
This dataset deals with a classification problem of predicting if a passenger would survive or not given some information about him/her.
Note: Variables and Features these terms are used interchangeably at many times in this tutorial.
To simplify things even further, you will only use four variables: age, sex, embarked, and survived where survived is the class label.
# Import dependencies import pandas as pd import numpy as np
# Load the dataset in a dataframe object and include only four features as mentioned url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv" df = pd.read_csv(url) include = ['Age', 'Sex', 'Embarked', 'Survived'] # Only four features df_ = df[include]
"Sex" and "Embarked" are categorical features with non-numeric values and that is why they require some numeric transformations. “Age” feature has missing values. These values can be imputed with a summary statistic such as median or mean. Missing values can be quite meaningful, and it is worth investigating what they represent in real-world applications.
Scikit-learn treats the cell values which do not contain anything as NaNs. Here, you will merely replace NaNs with 0, and you will write a helper function for that.
categoricals = [] for col, col_type in df_.dtypes.iteritems(): if col_type == 'O': categoricals.append(col) else: df_[col].fillna(0, inplace=True)
The above lines of code does the following:
Iterates over all the columns in the dataframe df and appending the columns (with non-numeric values) to a list categorical.
If the columns do not have non-numeric values (which is only Age in this case), then it checks if it has missing values or not and fills them with 0. Filling NaNs with a single value may have unintended consequences, especially if the amount that you’re replacing NaNs with is within the observed range for the numeric variable. Since zero is not an observed and legitimate age value you are not introducing bias, you would have if you used say 36! - Source
Now that you handled the missing values and separated the non-numeric columns you are ready to convert them to numeric ones. You will do this by using One Hot Encoding. Pandas provides a simple method get_dummies() for creating OHE variables for a given dataframe.
df_ohe = pd.get_dummies(df_, columns=categoricals, dummy_na=True)
When you use OHE, a new column is created for every column/value combination, in a column_value format. For example - for the “Embarked” variable, OHE will produce “Embarked_C”, “Embarked_Q”, “Embarked_S”, and “Embarked_nan”.
Now that you’ve successfully preprocessed your dataset, you’re ready to train the machine learning model. You will use a Logistic Regression classifier for this.
from sklearn.linear_model import LogisticRegression dependent_variable = 'Survived' x = df_ohe[df_ohe.columns.difference([dependent_variable])] y = df_ohe[dependent_variable] lr = LogisticRegression() lr.fit(x, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
You have built your machine learning model. You will now save this model. Technically speaking, you will serialize this model. In Python, you call this Pickling.
Saving the model: Serialization and Deserialization
You will use sklearn’s joblib for this.
from sklearn.externals import joblib joblib.dump(lr, 'model.pkl')
['model.pkl']
The Logistic Regression model is now persisted. You can load this model into memory with a single line of code. Loading the model back into your workspace is known as Deserialization.
lr = joblib.load('model.pkl')
You’re now ready to use Flask to serve your persisted model. You have already seen how minimalistic Flask is to get started with.
Creating an API from a machine learning model using Flask
For serving your model with Flask, you will do the following two things:
Load the already persisted model into memory when the application starts,
Create an API endpoint that takes input variables, transforms them into the appropriate format, and returns predictions.
More specifically, your sample input to the API will look like the following:
[ {"Age": 85, "Sex": "male", "Embarked": "S"}, {"Age": 24, "Sex": '"female"', "Embarked": "C"}, {"Age": 3, "Sex": "male", "Embarked": "C"}, {"Age": 21, "Sex": "male", "Embarked": "S"} ]
(which is a JSON list of inputs)
and your API will output like the following:
{"prediction": [0, 1, 1, 0]}
The predictions denote the survival statuses where 0 represents No and 1 represents Yes.
JSON stands for JavaScript Object Notation, and it is one of the most widely used data interchange formats. If you need a quick introduction to it, please follow these tutorials.
Let's write a function predict() which will do:
Load the persisted model into memory when the application starts,
Create an API endpoint that takes input variables, transforms them into the appropriate format, and returns predictions.
You have already seen how to load a persisted model. Now, you will focus on how you can use it for predicting the survival status upon receiving inputs.
from flask import Flask, jsonify app = Flask(__name__) @app.route('/predict', methods=['POST']) def predict(): json_ = request.json query_df = pd.DataFrame(json_) query = pd.get_dummies(query_df) prediction = lr.predict(query) return jsonify({'prediction': list(prediction)})
Fantastic! But you have got a little problem here.
The function that you wrote would only work under conditions where the incoming request contains all possible values for the categorical variables which may or may not be the case in real-time. If the incoming request does not include all possible values of the categorical variables then as per the current method definition of predict(), get_dummies() would generate a dataframe that has fewer columns than the classifier excepts, which would result in a runtime error.
To solve this problem, you will persist the list of columns during model training as well. You can serialize any Python object into a .pkl file. You will use joblib in the same way as previously.
(Keep that in mind, as discussed earlier it is always better to do all the server level coding in a text editor and then run it from a terminal)
model_columns = list(x.columns) joblib.dump(model_columns, 'model_columns.pkl')
['model_columns.pkl']
As you have persisted the list of columns already, you can just handle the missing values at the time of prediction. You will have to load model columns when the application starts.
@app.route('/predict', methods=['POST']) # Your API endpoint URL would consist /predict def predict(): if lr: try: json_ = request.json query = pd.get_dummies(pd.DataFrame(json_)) query = query.reindex(columns=model_columns, fill_value=0) prediction = list(lr.predict(query)) return jsonify({'prediction': prediction}) except: return jsonify({'trace': traceback.format_exc()}) else: print ('Train the model first') return ('No model here to use')
You included all the required elements in the "/predict" API, and now you just need to write the main class.
if __name__ == '__main__': try: port = int(sys.argv[1]) # This is for a command-line argument except: port = 12345 # If you don't provide any port then the port will be set to 12345 lr = joblib.load(model_file_name) # Load "model.pkl" print ('Model loaded') model_columns = joblib.load(model_columns_file_name) # Load "model_columns.pkl" print ('Model columns loaded') app.run(port=port, debug=True)
Your API now ready to be hosted. But before going any further, let's recap all that you did till this point:
Putting it all together:
You loaded Titanic dataset and selected the four features.
You did the necessary data preprocessing.
You built a Logistic Regression classifier and serialized it.
You also serialized all the columns from training as a solution to the less than expected number of columns is to persist the list of columns from training.
You then wrote a simple API using Flask that would predict if a person had survived in the shipwreck given there age, sex and embarked information.
Let's put all the code in one place so that you don't miss out on anything. Also, it is a good programming practice if you separate your Logistic Regression model code and your Flask API code into separate .py files.
So your model.py should look like the following:
# Import dependencies import pandas as pd import numpy as np # Load the dataset in a dataframe object and include only four features as mentioned url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv" df = pd.read_csv(url) include = ['Age', 'Sex', 'Embarked', 'Survived'] # Only four features df_ = df[include] # Data Preprocessing categoricals = [] for col, col_type in df_.dtypes.iteritems(): if col_type == 'O': categoricals.append(col) else: df_[col].fillna(0, inplace=True) df_ohe = pd.get_dummies(df_, columns=categoricals, dummy_na=True) # Logistic Regression classifier from sklearn.linear_model import LogisticRegression dependent_variable = 'Survived' x = df_ohe[df_ohe.columns.difference([dependent_variable])] y = df_ohe[dependent_variable] lr = LogisticRegression() lr.fit(x, y) # Save your model from sklearn.externals import joblib joblib.dump(lr, 'model.pkl') print("Model dumped!") # Load the model that you just saved lr = joblib.load('model.pkl') # Saving the data columns from training model_columns = list(x.columns) joblib.dump(model_columns, 'model_columns.pkl') print("Models columns dumped!")
Your api.py should look like the following:
# Dependencies from flask import Flask, request, jsonify from sklearn.externals import joblib import traceback import pandas as pd import numpy as np # Your API definition app = Flask(__name__) @app.route('/predict', methods=['POST']) def predict(): if lr: try: json_ = request.json print(json_) query = pd.get_dummies(pd.DataFrame(json_)) query = query.reindex(columns=model_columns, fill_value=0) prediction = list(lr.predict(query)) return jsonify({'prediction': str(prediction)}) except: return jsonify({'trace': traceback.format_exc()}) else: print ('Train the model first') return ('No model here to use') if __name__ == '__main__': try: port = int(sys.argv[1]) # This is for a command-line input except: port = 12345 # If you don't provide any port the port will be set to 12345 lr = joblib.load("model.pkl") # Load "model.pkl" print ('Model loaded') model_columns = joblib.load("model_columns.pkl") # Load "model_columns.pkl" print ('Model columns loaded') app.run(port=port, debug=True)
Pretty neat! Now you will test this API in an API client called Postman. Just make sure that model.py and api.py are in the same directory and also make sure that you have compiled them both before testing. Refer to the following snapshot of the terminal which is taken after both the .py files were compiled successfully.
If all of your files were compiled successfully, the following should be the directory structure:
Note: The IPYNB file is optional though.
Testing your API in Postman
In order to test your API, you will need some kind of API client. Postman is undoubtedly one of the best ones out there. You can easily download Postman from the above link.
The Postman interface looks like the following if you downloaded the latest one:
After you have started the Flask server successfully, you then need to enter the right URL with the correct port number in Postman. It should look similar to the following:
Congratulations! You just built your first ever machine learning API.
Your API can predict if a passenger survived the Titanic shipwreck given there age, sex and embarked information. Now, your friend may call it from there front-end code and process the output of the API into something fascinating.
Taking it further:
In this tutorial, you covered one of the most vital industry demanding skills of a full-stack Data Scientist, i.e. building an API from a machine learning model. Although the API is straightforward, it is always better to start with the simplest of things so that you know the know-how in the details.
You can do a lot more in order to improve this. Possible options you might want to consider:
Write a "/train" API which would train a Logistic Regression classifier with the data.
Code a Neural Network model using keras and build an API out of it.
Host your API on Cloud so that it can be consumed.
For taking things to more advanced levels, you might refer to this Machine Learning Mastery blog which discusses several industry graded approaches.
The possibilities and opportunities are enormous here. You just need to carefully select the ones which are the most suitable for you.
If you would like to learn more about Machine Learning in Python, take DataCamp's Preprocessing for Machine Learning in Python course.
References:
The following are some references that were taken while writing this blog:
Different types of data
Business analysts and data scientists come across many different types of data in their analytics projects. Most data commonly found in academic and industrial projects can be broadly classified into the following categories:
Cross-sectional data
Time series data
Panel data
Understanding what type of data is needed to solve a problem and what type of data can be obtained from available sources is important for formulating the problem and choosing the right methodology for analysis.
Cross-sectional data Cross-sectional data or cross-section of a population is obtained by taking observations from multiple individuals at the same point in time. Cross-sectional data can comprise of observations taken at different points in time, however, in such cases time itself does not play any significant role in the analysis. SAT scores of high school students in a particular year is an example of cross-sectional data. Gross domestic product of countries in a given year is another example of cross-sectional data. Data for customer churn analysis is another example of cross-sectional data. Note that, in case of SAT scores of students and GDP of countries, all the observations have been taken in a single year and this makes the two datasets cross-sectional. In essence, the cross-sectional data represents a snapshot at a given instance of time in both the cases. However, customer data for churn analysis can be obtained from over a span of time such as years and months. But for the purpose of analysis, time might not play an important role and therefore though customer churn data might be sourced from multiple points in time, it may still be considered as a cross-sectional dataset. Often, analysis of cross-sectional data starts with a plot of the variables to visualize their statistical properties such as central tendency, dispersion, skewness, and kurtosis. The following figure illustrates this with the univariate example of military expenditure as a percentage of Gross Domestic Product of 85 countries in the year 2010. By taking the data from a single year we ensure its cross-sectional nature. The figure combines a normalized histogram and a kernel density plot in order to highlight different statistical properties of the military expense data. As evident from the plot, military expenditure is slightly left skewed with a major peak at roughly around 1.0 %. A couple of minor peaks can also be observed near 6.0 % and 8.0 %. Figure 1.1: Example of univariate cross-sectional data Exploratory data analysis such as the one in the preceding figure can be done for multiple variables as well in order to understand their joint distribution. Let us illustrate a bivariate analysis by considering total debt of the countries' central governments along with their military expenditure in 2010. The following figure shows the joint distributions of these variables as kernel density plots. The bivariate joint distribution shows no clear correlation between the two, except may be for lower values of military expenditure and debt of central government. Figure 1.2: Example of bi-variate cross-sectional data Note It is noteworthy that analysis of cross-sectional data extends beyond exploratory data analysis and visualization as shown in the preceding example. Advanced methods such as cross-sectional regression fit a linear regression model between several explanatory variables and a dependent variable. For example, in case of customer churn analysis, the objective could be to fit a logistic regression model between customer attributes and customer behavior described by churned or not-churned. The logistic regression model is a special case of generalized linear regression for discrete and binary outcome. It explains the factors that make customers churn and can predict the outcome for a new customer. Since time is not a crucial element in this type of cross-sectional data, predictions can be obtained for a new customer at a future point in time. In this book, we discuss techniques for modeling time series data in which time and the sequential nature of observations are crucial factors for analysis. The dataset of the example on military expenditures and national debt of countries has been downloaded from the Open Data Catalog of World Bank. You can find the data in the WDIData.csv file under the datasets folder of this book's GitHub repository. All examples in this book are accompanied by an implementation of the same in Python. So let us now discuss the Python program written to generate the preceding figures. Before we are able to plot the figures, we must read the dataset into Python and familiarize ourselves with the basic structure of the data in terms of columns and rows found in the dataset. Datasets used for the examples and figures, in this book, are in Excel or CSV format. We will use the pandas package to read and manipulate the data. For visualization, matplotlib and seaborn are used. Let us start by importing all the packages to run this example: Copy from __future__ import print_function import os import pandas as pd import numpy as np %matplotlib inline from matplotlib import pyplot as plt import seaborn as sns The print_function has been imported from the __future__ package to enable using print as a function for readers who might be using a 2.x version of Python. In Python 3.x, print is by default a function. As this code is written and executed from an IPython notebook, %matplotlib inline ensures that the graphics packages are imported properly and made to work in the HTML environment of the notebook. The os package is used to set the working directory as follows: Copy os.chdir('D:\Practical Time Series') Now, we read the data from the CSV file and display basic information about it: Copy data = pd.read_csv('datasets/WDIData.csv') print('Column names:', data.columns) This gives us the following output showing the column names of the dataset: Copy Column names: Index([u'Country Name', u'Country Code', u'Indicator Name', u'Indicator Code', u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967', u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975', u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983', u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991', u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999', u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015', u'2016'], dtype='object') Let us also get a sense of the size of the data in terms of number of rows and columns by running the following line: Copy print('No. of rows, columns:', data.shape) This returns the following output: Copy No. of rows, columns: (397056, 62) This dataset has nearly 400k rows because it captures 1504 world development indicators for 264 different countries. This information about the unique number of indicators and countries can be obtained by running the following four lines: Copy nb_countries = data['Country Code'].unique().shape[0] print('Unique number of countries:', nb_countries) As it appears from the structure of the data, every row gives the observations about an indicator that is identified by columns Indicator Name and Indicator Code and for the country, which is indicated by the columns Country Name and Country Code. Columns 1960 through 2016 have the values of an indicator during the same period of time. With this understanding of how the data is laid out in the DataFrame, we are now set to extract the rows and columns that are relevant for our visualization. Let us start by preparing two other DataFrames that get the rows corresponding to the indicators Total Central Government Debt (as % of GDP) and Military expenditure (% of GDP) for all the countries. This is done by slicing the original DataFrame as follows: Copy central_govt_debt = data.ix[data['Indicator Name']=='Central government debt, total (% of GDP)'] military_exp = data.ix[data['Indicator Name']=='Military expenditure (% of GDP)'] The preceding two lines create two new DataFrames, namely central_govt_debt and military_exp. A quick check about the shapes of these DataFrames can be done by running the following two lines: Copy print('Shape of central_govt_debt:', central_govt_debt.shape) print('Shape of military_exp:', military_exp.shape) These lines return the following output: Copy Shape of central_govt_debt: (264, 62) Shape of military_exp: (264, 62) These DataFrames have all the information we need. In order to plot the univariate and bivariate cross-sectional data in the preceding figure, we need the column 2010. Before we actually run the code for plotting, let us quickly check if the column 2010 has missing. This is done by the following two lines: Copy central_govt_debt['2010'].describe() military_exp['2010'].describe() Which generate the following outputs respectively: Copy count 93.000000 mean 52.894412 std 30.866372 min 0.519274 25% NaN 50% NaN 75% NaN max 168.474953 Name: 2010, dtype: float64 count 194.000000 mean 1.958123 std 1.370594 min 0.000000 25% NaN 50% NaN 75% NaN max 8.588373 Name: 2010, dtype: float64 Which tells us that the describe function could not compute the 25th, 50th, and 75th quartiles for either, hence there are missing values to be avoided. Additionally, we would like the Country Code column to be the row indices. So the following couple of lines are executed: Copy central_govt_debt.index = central_govt_debt['Country Code'] military_exp.index = military_exp['Country Code'] Next, we create two pandas.Series by taking non-empty 2010 columns from central_govt_debt and military_exp. The newly created Series objects are then merged into to form a single DataFrame: Copy central_govt_debt_2010 = central_govt_debt['2010'].ix[~pd.isnull(central_govt_debt['2010'])] military_exp_2010 = military_exp['2010'].ix[~pd.isnull(military_exp['2010'])] data_to_plot = pd.concat((central_govt_debt_2010, military_exp_2010), axis=1) data_to_plot.columns = ['central_govt_debt', 'military_exp'] data_to_plot.head() The preceding lines return the following table that shows that not all countries have information on both Central Government Debt and Military Expense for the year 2010: central_govt_debtmilitary_expAFGNaN897473AGONaN244884ALBNaN558592ARBNaN122879ARENaN119468ARGNaN814878ARMNaN265646ATG289093NaNAUS356946951809AUT408304824770 To plot, we have to take only those countries that have both central government debt and military expense. Run the following line, to filter out rows with missing values: Copy data_to_plot = data_to_plot.ix[(~pd.isnull(data_to_plot.central_govt_debt)) & (~pd.isnull(data_to_plot.military_exp)), :] The first five rows of the filtered DataFrame are displayed by running the following line: Copy data_to_plot.head() central_govt_debtmilitary_expAUS356946951809AUT408304824770AZE385576791004BEL022605084631BGR286254765384AUS356946951809AUT408304824770AZE385576791004BEL022605084631BGR286254765384 The preceding table has only non-empty values and we are now ready to generate the plots for the cross-sectional data. The following lines of code generate the plot on the univariate cross-sectional data on military expense: Copy plt.figure(figsize=(5.5, 5.5)) g = sns.distplot(np.array(data_to_plot.military_exp), norm_hist=False) g.set_title('Military expenditure (% of GDP) of 85 countries in 2010') The plot is saved as a png file under the plots/ch1 folder of this book's GitHub repository. We will also generate the bivariate plot between military expense and central government debt by running the following code: Copy plt.figure(figsize=(5.5, 5.5)) g = sns.kdeplot(data_to_plot.military_exp, data2=data_to_plot.central_govt_debt) g.set_title('Military expenditures & Debt of central governments in 2010')
Time series data The example of cross-sectional data discussed earlier is from the year 2010 only. However, instead if we consider only one country, for example United States, and take a look at its military expenses and central government debt for a span of 10 years from 2001 to 2010, that would get two time series - one about the US federal military expenditure and the other about debt of US federal government. Therefore, in essence, a time series is made up of quantitative observations on one or more measurable characteristics of an individual entity and taken at multiple points in time. In this case, the data represents yearly military expenditure and government debt for the United States. Time series data is typically characterized by several interesting internal structures such as trend, seasonality, stationarity, autocorrelation, and so on. These will be conceptually discussed in the coming sections in this chapter. The internal structures of time series data require special formulation and techniques for its analysis. These techniques will be covered in the following chapters with case studies and implementation of working code in Python. The following figure plots the couple of time series we have been talking about: Figure 1.3: Examples of time series data In order to generate the preceding plots we will extend the code that was developed to get the graphs for the cross-sectional data. We will start by creating two new Series to represent the time series of military expenses and central government debt of the United States from 1960 to 2010: Copy central_govt_debt_us = central_govt_debt.ix[central_govt_debt['Country Code']=='USA', :].T military_exp_us = military_exp.ix[military_exp['Country Code']=='USA', :].T The two Series objects created in the preceding code are merged to form a single DataFrame and sliced to hold data for the years 2001 through 2010: Copy data_us = pd.concat((military_exp_us, central_govt_debt_us), axis=1) index0 = np.where(data_us.index=='1960')[0][0] index1 = np.where(data_us.index=='2010')[0][0] data_us = data_us.iloc[index0:index1+1,:] data_us.columns = ['Federal Military Expenditure', 'Debt of Federal Government'] data_us.head(10) The data prepared by the preceding code returns the following table: Federal Military ExpenditureDebt of Federal Government1960NaNNaN1961NaNNaN1962NaNNaN1963NaNNaN1964NaNNaN1965NaNNaN1966NaNNaN1967NaNNaN1968NaNNaN1969NaNNaN The preceding table shows that data on federal military expenses and federal debt is not available from several years starting from 1960. Hence, we drop the rows with missing values from the Dataframe data_us before plotting the time series: Copy data_us.dropna(inplace=True) print('Shape of data_us:', data_us.shape) As seen in the output of the print function, the DataFrame has twenty three rows after dropping the missing values: Copy Shape of data_us: (23, 2) After dropping rows with missing values, we display the first ten rows of data_us are displayed as follows: Federal Military ExpenditureDebt of Federal Government19885799302581989374721439199012025377219915398563319926662660161993326931657199494129347519956384923661996350747174199720992997 Finally, the time series are generated by executing the following code: Copy # Two subplots, the axes array is 1-d f, axarr = plt.subplots(2, sharex=True) f.set_size_inches(5.5, 5.5) axarr[0].set_title('Federal Military Expenditure during 1988-2010 (% of GDP)') data_us['Federal Military Expenditure'].plot(linestyle='-', marker='*', color='b', ax=axarr[0]) axarr[1].set_title('Debt of Federal Government during 1988-2010 (% of GDP)') data_us['Debt of Federal Government'].plot(linestyle='-', marker='*', color='r', ax=axarr[1])
Panel data So far, we have seen data taken from multiple individuals but at one point in time (cross-sectional) or taken from an individual entity but over multiple points in time (time series). However, if we observe multiple entities over multiple points in time we get a panel data also known as longitudinal data. Extending our earlier example about the military expenditure, let us now consider four countries over the same period of 1960-2010. The resulting data will be a panel dataset. The figure given below illustrates the panel data in this scenario. Rows with missing values, corresponding to the period 1960 to 1987 have been dropped before plotting the data. Figure 1.4: Example of panel data Note A generic panel data regression model can be stated as y_it = W x _it +b+ ϵ _it, which expresses the dependent variable y_it as a linear model of explanatory variable x_it, where W are weights of x_it, b is the bias term, and ϵ_it is the error. i represents individuals for whom data is collected for multiple points in time represented by j. As evident, this type of panel data analysis seeks to model the variations across both multiple individual and multiple points in time. The variations are reflected by ϵ _it and assumptions determine the necessary mathematical treatment. For example, if ϵ_it is assumed to vary non-stochastically with respect to i and t, then it reduces to a dummy variable representing random noise. This type of analysis is known as fixed effects model. On the other hand, ϵ_it varying stochastically over i and t requires a special treatment of the error and is dealt in a random effects model. Let us prepare the data that is required to plot the preceding figure. We will continue to expand the code we have used for the cross-sectional and time series data previously in this chapter. We start by creating a DataFrame having the data for the four companies mentioned in the preceding plot. This is done as follows: Copy chn = data.ix[(data['Indicator Name']=='Military expenditure (% of GDP)')&\ (data['Country Code']=='CHN'),index0:index1+1 ] chn = pd.Series(data=chn.values[0], index=chn.columns) chn.dropna(inplace=True) usa = data.ix[(data['Indicator Name']=='Military expenditure (% of GDP)')&\ (data['Country Code']=='USA'),index0:index1+1 ] usa = pd.Series(data=usa.values[0], index=usa.columns) usa.dropna(inplace=True) ind = data.ix[(data['Indicator Name']=='Military expenditure (% of GDP)')&\ (data['Country Code']=='IND'),index0:index1+1 ] ind = pd.Series(data=ind.values[0], index=ind.columns) ind.dropna(inplace=True) gbr = data.ix[(data['Indicator Name']=='Military expenditure (% of GDP)')&\ (data['Country Code']=='GBR'),index0:index1+1 ] gbr = pd.Series(data=gbr.values[0], index=gbr.columns) gbr.dropna(inplace=True) Now that the data is ready for all five countries, we will plot them using the following code: Copy plt.figure(figsize=(5.5, 5.5)) usa.plot(linestyle='-', marker='*', color='b') chn.plot(linestyle='-', marker='*', color='r') gbr.plot(linestyle='-', marker='*', color='g') ind.plot(linestyle='-', marker='*', color='y') plt.legend(['USA','CHINA','UK','INDIA'], loc=1) plt.title('Miltitary expenditure of 5 countries over 10 years') plt.ylabel('Military expenditure (% of GDP)') plt.xlabel('Years')s
Comments