Jekyll2020-06-18T22:20:22+00:00https://udohsolomon.github.io/feed.xmlMy musings on Machine learning and AIAn amazing website.Solomon AmosModern Reference Architecture Deployed On AWS2020-06-19T00:00:00+00:002020-06-19T00:00:00+00:00https://udohsolomon.github.io/devops/modern-reference-architecture-deployed-on-aws-cloud<h1 id="-reference-architecture-infrastructure-overview"><client> Reference Architecture Infrastructure Overview</client></h1>
<p>This reference architecture deployed on top of <a href="https://aws.amazon.com">Amazon Web Services (AWS)</a> is an end-to-end tech stack built using Infrastructure as Code (IaC).
It is based on best practices described in <a href="https://d1.awsstatic.com/whitepapers/aws-web-hosting-best-practices.pdf">Web Application Hosting in the AWS Cloud</a> white paper and customers requirements that runs on highly available and scalable mode.
<img src="/assets/images/reference-architecture.png" alt="Infrastructure" class="align-center" /></p>
<p>The Reference Architecture is highly customisable, so what’s deployed may be a bit different from what is in the diagram. Here is an overview of what is actually deployed:</p>
<ol>
<li><a href="#infrastructure-as-code">Infrastructure as code</a></li>
<li><a href="#environments">Environments</a></li>
<li><a href="#aws-accounts">AWS accounts</a></li>
<li><a href="#vpcs-and-subnets">VPCs and subnets</a></li>
<li><a href="#load-balancers">Load balancers</a></li>
<li><a href="#docker-clusters">Docker clusters (ECS)</a></li>
<li><a href="#data-stores">Data stores</a></li>
<li><a href="#openvpn-server">OpenVPN server</a></li>
<li><a href="#circleci">CircleCI</a></li>
<li><a href="#monitoring-log-aggregation-alerting">Monitoring, log aggregation, alerting</a></li>
<li><a href="#dns-and-tls">DNS and TLS</a></li>
<li><a href="#static-content-s3-and-cloudfront">Static content, S3, and CloudFront</a></li>
<li><a href="#lambda">Lambda</a></li>
<li><a href="#security">Security</a></li>
</ol>
<h2 id="infrastructure-as-code">Infrastructure as code</h2>
<p>The infrastructure is managed as <strong>code</strong>, primarily using <a href="https://www.terraform.io/">Terraform</a>.
That is, instead of clicking around a web UI or SSHing to a server and manually executing commands, the idea behind
infrastructure as code (IaC) is that you write code to define your infrastructure and you let an automated tool (e.g.,
Terraform) apply the code changes to your infrastructure. This has a number of benefits:</p>
<ul>
<li>
<p>You can automate your entire provisioning and deployment process, which makes it much faster and more reliable than
any manual process.</p>
</li>
<li>
<p>You can represent the state of your infrastructure in source files that anyone can read rather than a sysadmin’s head.</p>
</li>
<li>
<p>You can store those source files in version control, which means the entire history of your infrastructure is
captured in the commit log, which you can use to debug problems, and if necessary, roll back to older versions.</p>
</li>
<li>
<p>You can validate each infrastructure change through code reviews and automated tests.</p>
</li>
<li>
<p>You can package your infrastructure as reusable, documented, battle-tested modules that make it easier to scale and
evolve your infrastructure.</p>
</li>
</ul>
<h1 id="environments">Environments</h1>
<p>The infrastructure is deployed across multiple environments:</p>
<ul>
<li>
<p><strong>dev</strong> (account id): Sandbox environment.</p>
</li>
<li>
<p><strong>prod</strong> (account id): Production environment.</p>
</li>
<li>
<p><strong>security</strong> (account id): All IAM users and permissions are defined in this account.</p>
</li>
<li>
<p><strong>shared-services</strong> (account id): DevOps tooling, such as the OpenVPN server.</p>
</li>
<li>
<p><strong>stage</strong> (account id): Pre-production environment.</p>
</li>
</ul>
<h2 id="aws-accounts">AWS accounts</h2>
<p>The infrastructure is deployed across multiple AWS accounts. For example, the development environment is in one account,
the production environment in another account, the DevOps tooling in yet another account, and so on. This gives you
better isolation between environments so that if you break something in one environment (e.g., staging)—or worse yet, a
hacker breaks into that environment—it should have no effect on your other environments (e.g., prod). It also gives you
better control over what resources each employee can access. This concept is known as defense in depth.</p>
<h2 id="vpcs-and-subnets">VPCs and subnets</h2>
<p>Each environment lives in a separate <a href="https://aws.amazon.com/vpc/">Virtual Private Cloud (VPC)</a>, which is a logically
isolated section within an AWS account. Each VPC defines a virtual network, with its own IP address space and rules for
what can go in and out of that network. The IP addresses within each VPC are further divided into multiple
<a href="http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Subnets.html">subnets</a>, where each subnet controls the
routing for its IP address.</p>
<ul>
<li><em>Public subnets</em> are directly accessible from the public Internet.</li>
<li><em>Private subnets</em> are only accessible from within the VPC.</li>
</ul>
<p>Just about everything in this infrastructure is deployed in private subnets to reduce the surface area to attackers.
The only exceptions are load balancers and the OpenVPN server, both of which are described below.</p>
<h2 id="load-balancers">Load balancers</h2>
<p>Traffic from the public Internet (e.g., requests from your users) initially goes to a <em>public load balancer</em>, which
proxies the traffic to the apps apis. This allows you to run multiple copies of application for scalability and high
availability. The load balancers being used are:</p>
<ul>
<li><a href="https://aws.amazon.com/elasticloadbalancing/applicationloadbalancer/">Application Load Balancer (ALB)</a>: The ALB is a
load balancer managed by AWS that is designed for routing HTTP and HTTPS traffic. The advantage of using a managed
service is that AWS takes care of fault tolerance, security, and scaling the load balancer for you automatically.</li>
</ul>
<p>We also deploy an <em>internal</em> load balancer in the private subnets. This load balancer is not accessible to the public.
Instead, it’s used as a simple way to do service discovery: every backend service registers with the load balancer at a
particular path, and all services know to send requests to this load balancer to talk to other services.</p>
<h2 id="docker-clusters">Docker clusters</h2>
<p>The application code is packaged into <a href="http://docker.com/">Docker containers</a> and deployed across an Amazon
<a href="https://aws.amazon.com/ecs/">EC2 Container Service (ECS)</a> cluster.
The advantage of Docker is that it allows you to package
your code so that it runs exactly the same way in all environments (dev, stage, prod). The advantage of a Docker
Cluster is that it makes it easy to deploy your Docker containers across a cluster of servers, making efficient use of
wherever resources are available. Moreover, ECS can automatically scale your app up and down in response to load and
redeploy containers that crashed.</p>
<p>For a quick intro to Docker, see <a href="http://www.ybrikman.com/writing/2016/03/31/infrastructure-as-code-microservices-aws-docker-terraform-ecs/">Running microservices on AWS using Docker, Terraform, and
ECS</a>.</p>
<h2 id="data-stores">Data stores</h2>
<p>The infrastructure includes the following data stores:</p>
<ol>
<li>
<p><strong>Postgres</strong>: Postgres is deployed using <a href="https://aws.amazon.com/rds/">Amazon’s Relational Database Service
(RDS)</a>, including automatic failover, backups, and replicas.</p>
</li>
<li>
<p><strong>Memcached</strong>: Memcached is deployed using <a href="https://aws.amazon.com/elasticache/">Amazon’s ElastiCache
Service</a>, including automatic failover, backups, and replicas.</p>
</li>
</ol>
<h2 id="lambda">Lambda</h2>
<p>We have deployed several example <a href="https://aws.amazon.com/lambda/">Lambda functions</a> to show how you can build
serverless applications.</p>
<h2 id="openvpn-server">OpenVPN server</h2>
<p>To reduce your surface area to attackers, just about all of the resources in this infrastructure run in private subnets,
which are not accessible from the public Internet at all. To allow developers access to these
private resources, we expose a single server publicly: an <a href="https://openvpn.net/">OpenVPN server</a>. Once you connect to
the server using a VPN client, you are “in the network”, and will be able to access the private resources (e.g., you
will be able to SSH to your EC2 Instances).</p>
<h2 id="circleci">CircleCI</h2>
<p>We have set up <a href="https://circleci.com/">CircleCi</a> as a Continuous Integration (CI) server. After every commit, a CircleCi
job runs your build, tests, packaging, and automated deployment steps.</p>
<h2 id="monitoring-log-aggregation-alerting">Monitoring, log aggregation, alerting</h2>
<p>You can find metrics, log files from all your servers, and subscribe to alert notifications using <a href="https://aws.amazon.com/cloudwatch/">Amazon
CloudWatch</a>.</p>
<h2 id="dns-and-tls">DNS and TLS</h2>
<p><a href="https://aws.amazon.com/route53/">Amazon Route 53</a> is used to configure DNS entries for all your services. We
have configured SSL/TLS certificates for domain names using <a href="https://aws.amazon.com/certificate-manager/">Amazon’s Certificate Manager
(ACM)</a>, which issues certificates that are free and renew automatically.</p>
<h2 id="static-content-s3-and-cloudfront">Static content, S3, and CloudFront</h2>
<p>All static content (e.g., images, CSS, JS) is stored in <a href="https://aws.amazon.com/s3/">Amazon S3</a> and served via the
<a href="https://aws.amazon.com/cloudfront/">CloudFront</a> CDN. This allows you to offload all the work of serving static content
from your app server and reduces latency for your users.</p>
<h2 id="security">Security</h2>
<p>Security best practices is built in every aspect of this infrastructure:</p>
<ul>
<li>
<p><strong>Network security</strong></p>
</li>
<li>
<p><strong>Server access</strong></p>
</li>
<li>
<p><strong>Application secrets</strong></p>
</li>
<li>
<p><strong>User accounts</strong></p>
</li>
<li>
<p><strong>Auditing</strong></p>
</li>
<li>
<p><strong>Intrusion detection</strong></p>
</li>
<li>
<p><strong>Security updates</strong></p>
</li>
<li>
<p><strong>OS hardening</strong></p>
</li>
<li>
<p><strong>End-to-end encryption</strong></p>
</li>
</ul>Solomon AmosReference Architecture Infrastructure OverviewNotes from the book ‘Deep Work: Rules for Focused Success in a Distracted World’2018-12-15T00:00:00+00:002018-12-15T00:00:00+00:00https://udohsolomon.github.io/productivity/Notes-from-the-book-Deep-Work-Rules-for-Focused-Success-in-a-Distracted-World<ul>
<li>Two Core Abilities for Thriving in the Digital Economy
<ul>
<li>The ability to quickly master hard things.</li>
<li>The ability to produce at an elite level, in terms of both quality and speed.</li>
</ul>
</li>
<li>
<p>This conclusion informs the rest of the book. If you want to be good at these two skills, the most important thing to be good at is deep work.</p>
</li>
<li>
<p>You have a finite amount of willpower that becomes depleted as you use it. The key to developing a deep work habit is to move beyond good intentions and add routines and rituals to your working life designed to minimize the amount of your limited willpower necessary to transition into and maintain a state of unbroken concentration.</p>
</li>
<li>
<p>Focus on the widely important - “The more you try to do, the less you actually accomplish”</p>
</li>
<li>
<p>Act on the lead measures - Success needs to be measured</p>
</li>
<li>
<p>Schedule deep valuable work time early each day</p>
</li>
<li>Shut down work thinking completely at the end of a work day
<ul>
<li>Downtime aids insights</li>
<li>Downtime helps recharge the energy needed to work deeply</li>
<li>The work that downtime replaces is usually not that important</li>
</ul>
</li>
<li>
<p>Deliberate practice is the systematic stretching of your ability for a given skill. It is the activity required to get better at something. Deep work and deliberate practice overlap substantially.</p>
</li>
<li>
<p>Your ritual needs to specify a location for your deep work efforts.</p>
</li>
<li>
<p>Ensure regularity in where you do your deep work</p>
</li>
<li>
<p>Avoid shared workspaces when doing deep work</p>
</li>
<li>
<p>Don’t Take Breaks from Distraction. Instead Take Breaks from Focus</p>
</li>
<li>
<p>Be comfortable being bored; idleness is essential to mental recovery</p>
</li>
<li>
<p>To succeed with deep work, we must constantly and deliberatly rewired our brains to be comfortable eliminate distracted stimuli</p>
</li>
<li>Meditate productively
<ul>
<li>Productive meditation - A period in which you are occupied physically but not mentally, like walking, jugging, driving running etc and then focus your attention on a single well defined problem. This also requires practice to do well. Be wary of distraction and looping. Structure your deep thinking. Reviewing and storing variables, identifying and tackling the next step question and then consolidating your gains improve your ability to go deep.</li>
</ul>
</li>
<li>
<p>The ability to concentrate intensely is a skill that must be trained.</p>
</li>
<li>
<p>Build a habit with a set starting time that you use every day for deep work</p>
</li>
<li>
<p>Map out when you’ll work deeply during each week at the beginning of the week, and then refine these decisions, as needed, at the beginning of each day</p>
</li>
<li>
<p>Deliberate practice: Start with an hour a day and build to three to four hours a day, five days a week, of uninterrupted and carefully directed concentration in combination with feedback so you can correct your approach</p>
</li>
<li>
<p>Its crucial you figure out in advance what you are going to do with your day, week and weekends before they begin. Structure hobbies provide good patterns</p>
</li>
<li>
<p>If you want to eliminate the addictive pool of entertainment sites on your time and attention, give your brain and mind a quality alternative. Not only will this preserve the ability to resist distraction and concentrate but you might even fulfil other ambitious goals and tasks.</p>
</li>
<li>
<p>Schedule Every Minute of Your Day</p>
</li>
<li>
<p>If you’re not sure how long a given activity might take, block off the expected time, then follow this with an additional block that has a split purpose. If you need more time for the preceding activity, use this additional block to keep working on it. If you finish the activity on time, however, have an alternate use already assigned for the extra block.</p>
</li>
<li>
<p>Be liberal with your use of task blocks. Deploy many throughout your day and make them longer than required. Loads of things not scheduled for come up during the day.</p>
</li>
<li>
<p>Quantifying the depth of every activity</p>
</li>
<li>Finish your work by 5:30pm Fixed schedule productivity</li>
</ul>Solomon AmosTwo Core Abilities for Thriving in the Digital Economy The ability to quickly master hard things. The ability to produce at an elite level, in terms of both quality and speed.Gaussian Processes from the Ground Up!2018-01-01T00:00:00+00:002018-01-01T00:00:00+00:00https://udohsolomon.github.io/bayesian%20inferences/gaussian-processes-from-ground-up<p>As a researcher, I have the habit of learning new and interesting things. Most times these things turn out to become very useful one way or the other. Recently, I found myself in this rabbit hole of <strong>Gaussian processes</strong>.</p>
<p>A common applied statistics task involves building regression models to characterize non-linear relationships between variables. It is possible to fit such models by assuming a particular non-linear functional form, such as a sinusoidal, exponential, or polynomial function, to describe one variable’s response to the variation in another. Unless this relationship is obvious from the outset, however, it involves possibly extensive model selection procedures to ensure the most appropriate model is retained.</p>
<p>Alternatively, a <strong>non-parametric</strong> approach can be adopted by defining a set of knots across the variable space and use a spline or kernel regression to describe arbitrary non-linear relationships. However, knot layout procedures are somewhat ad hoc and can also involve variable selection.</p>
<p>A third alternative is to adopt a <strong>Bayesian</strong> non-parametric strategy, and directly model the unknown underlying function. For this, we can employ <strong>Gaussian process</strong> models.</p>
<h2 id="gaussian-processes">Gaussian Processes</h2>
<p>Gaussian process is essentially an handy tool for Bayesian inferences on real valued variables. A Gaussian process is a powerful model that can be used to represent a distribution over functions. While most modern machine learning techniques tend to parametise functions and then model these parameters, Gaussian processes are non-parametric models that model the functions directly.</p>
<h2 id="bayesian-statistics">Bayesian Statistics</h2>
<p>Many people who have taken a statistics course may not have had a course in <em>Bayesian</em> statistics. Most introductory statistics courses, particularly for non-statisticians like myself, still do not cover Bayesian methods at all, except perhaps to derive Bayes’ formula as a trivial rearrangement of the definition of conditional probability. Even today, Bayesian courses are typically tacked onto the curriculum, rather than being integrated into the program.</p>
<p>In fact, Bayesian statistics is not just a particular method, or even a class of methods; it is an entirely different paradigm for doing statistical analysis.</p>
<blockquote>
<p>Practical methods for making inferences from data using probability models for quantities we observe and about which we wish to learn.
<em>– Gelman et al. 2013</em></p>
</blockquote>
<p>A Bayesian model is described by parameters, uncertainty in those parameters is described using probability distributions.</p>
<p>All conclusions from Bayesian statistical procedures are stated in terms of <em>probability statements</em></p>
<p><img src="/assets/images/gpintro/prob_model.png" alt="image-center" class="align-center" /></p>
<p>As a toy example, a child has a prior belief of how a sheep will look like. Later after taking a trip to a countryside with the parents, the parents point at a sheep to the child and say look there is a sheep over there. The child sees the sheep and gets a label. At this point, the child will update his prior belief based on the actual picture of the sheep. The child then combines the initial belief with the actual picture to get a new belief of what a sheep looks like.</p>
<p>In Bayesian inference, the initial belief of the child of what a sheep may look like is called the <strong>prior</strong>. The updated belief after seeing the actual picture is called the <strong>likelihood</strong> The child then combines the <strong>prior</strong> and the <strong>likelihood</strong> using <strong>Bayes’</strong> rule to obtain the <strong>posterior</strong>.</p>
<p><img src="/assets/images/gpintro/GP1.png" alt="bayesian_inference" class="align-center" /></p>
<p>Posterior distribution which is also a Gaussian gives us more confident about our belief and it’s calculated using <strong>Bayes’ Formula</strong> as shown.</p>
<p><img src="/assets/images/gpintro/bayes_formula.png" alt="bayes formula" class="align-center" /></p>
<p>The equation expresses how our belief about the value of \(\theta\), as expressed by the <strong>prior distribution</strong> \(P(\theta)\) is reallocated following the observation of the data \(y\), as expressed by the posterior distribution.</p>
<p>The denominator \(P(y)\) cannot be calculated directly, and is actually the expression in the numerator, integrated over all \(\theta\):</p>
<script type="math/tex; mode=display">Pr(\theta|y) = \frac{Pr(y|\theta)Pr(\theta)}{\int Pr(y|\theta)Pr(\theta) d\theta}</script>
<p>The intractability of this integral is one of the factors that has contributed to the under-utilization of Bayesian methods by statisticians.</p>
<p>Since we are usually unable to calculate the denominator, numerical approximations are typically applied to estimate the posterior distribution. We will get to the maths of Gaussian processes later in this post, but for now let us get the intuition behind Gaussian processes.</p>
<p>Let us say we intend to carry out the measurement of temperature over time using a temperature sensor. The measurements will look similar to what we have in the graph below.</p>
<p><img src="/assets/images/gpintro/temp.png" alt="bayesian_inference" class="align-center" /></p>
<p>The natural thing we may want to do next is to know what the temperature would be be at a particular time in the future.
Normally, when we want to measure the temperature, we fit in different dimensions of Gaussian processes. But rather than going through those individual Gaussian processes, we represent them as a function.</p>
<p><img src="/assets/images/gpintro/funct.png" alt="bayesian_inference" class="align-center" /></p>
<p>The way we are now going to represent the <strong>prior distribution</strong> is the mean which is the dash blue line at the centre of the of the function. The envelopes around it are two standard deviations which give us some degree of confidence in our measurements.</p>
<h2 id="modeling-functions-with-gaussians">Modeling Functions with Gaussians</h2>
<p>The major idea behind Gaussian processes is that a function can be modeled using an infinite dimensional multivariate Gaussian distribution. In other words, every point in the input space is associated with a random variable and the joint distribution of these is modeled as a multivariate Gaussian.</p>
<p>Given <script type="math/tex">x = (x_1, x_2)</script> is jointly Gaussian with parameters</p>
<script type="math/tex; mode=display">% <![CDATA[
{\mu = \begin{pmatrix}{\mu_1} \\ {\mu_2} \end{pmatrix}}, \space \space \Sigma = \left({ \begin{array}{c} {\Sigma_{11}} & {\Sigma_{12}} \\ {\Sigma_{21}} & {\Sigma_{22}} \\ \end{array} }\right) %]]></script>
<p>First, the marginal distribution of any subset of elements from a multivariate normal distribution is also normal:</p>
<script type="math/tex; mode=display">% <![CDATA[
p(x_1,x_2) = \mathcal{N}\left(\left[{
\begin{array}{c}
{\mu_{1}} \\
{\mu_{1}} \\
\end{array}
}\right], \left[{
\begin{array}{c}
{\Sigma_{11}} & {\Sigma_{12}} \\
{\Sigma_{21}} & {\Sigma_{22}} \\
\end{array}
}\right]\right) %]]></script>
<script type="math/tex; mode=display">p(x_1) = \mathcal{N}(x_1 | \mu_1, \Sigma_1)</script>
<script type="math/tex; mode=display">p(x_2) = \mathcal{N}(x_2 | \mu_2, \Sigma_2)</script>
<p>Also, conditional distributions of a subset of a multivariate normal distribution (conditional on the remaining elements) are normal too:</p>
<script type="math/tex; mode=display">p(x_1|x_2) = \mathcal{N}(x_1 \ \mu_{1|2}, \Sigma_{1|2})</script>
<p>where</p>
<script type="math/tex; mode=display">\begin{align}
\mu_{1 {\mid} 2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2) \\
\Sigma_{1 \mid 2} = \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}
\end{align}</script>
<p>A Gaussian process generalizes the multivariate normal to infinite dimension. It is defined as an infinite collection of random variables, any finite subset of which have a Gaussian distribution. Thus, the marginalization property is explicit in its definition. Another way of thinking about an infinite vector is as a <em>function</em>. When we write a function that takes continuous values as inputs, we are essentially specifying an infinte vector that only returns values (indexed by the inputs) when the function is called upon to do so. By the same token, this notion of an infinite-dimensional Gaussian as a function allows us to work with them computationally: we are never required to store all the elements of the Gaussian process, only to calculate them on demand.</p>
<p>So, we can describe a Gaussian process as a <strong>disribution over functions</strong>. Just as a multivariate normal distribution is completely specified by a mean vector and covariance matrix, a GP is fully specified by a <strong>mean function</strong> and a <strong>covariance function</strong>:</p>
<script type="math/tex; mode=display">p(x) \sim \mathcal{GP}(m(x), k(x,x^{\prime}))</script>
<p>It is the marginalization property that makes working with a Gaussian process feasible: we can marginalize over the infinitely-many variables that we are not interested in, or have not observed.</p>
<p>For example, one specification of a GP might be as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
m(x) &=0 \\
k(x,x^{\prime}) &= \theta_1\exp\left(-\frac{\theta_2}{2}(x-x^{\prime})^2\right)
\end{aligned} %]]></script>
<p>here, the covariance function is a <strong>squared exponential</strong>, for which values of <script type="math/tex">x</script> and <script type="math/tex">x^{\prime}</script> that are close together result in values of <script type="math/tex">k</script> closer to 1 and those that are far apart return values closer to zero.</p>
<p><img src="/assets/images/gpintro/output_5_0.png" alt="png" class="align-center" /></p>
<p>It may seem odd to simply adopt the zero function to represent the mean function of the Gaussian process – surely we can do better than that! It turns out that most of the learning in the GP involves the covariance function and its parameters, so very little is gained in specifying a complicated mean function.</p>
<p>For a finite number of points, the GP becomes a multivariate normal, with the mean and covariance as the mean functon and covariance function evaluated at those points.</p>
<p>For example, consider just two points from a squared exponential covariance function with parameters <script type="math/tex">\theta_1=1, \theta_2=2</script>, sampled at locations <script type="math/tex">x_1=0</script> and <script type="math/tex">x_2=0.6</script>.</p>
<p>Let’s consider a value of -1 sampled from <script type="math/tex">x_1</script>. According to our model, there is a dependence regarding where <script type="math/tex">x_2</script> will be located, governed by the covariance of the two variables.</p>
<p><img src="/assets/images/gpintro/output_13_0.png" alt="png" class="align-center" /></p>
<p>We can apply the normal distribution of <script type="math/tex">x_2 \ x_1</script> from above to see how <script type="math/tex">x_2</script> is constrained:</p>
<script type="math/tex; mode=display">p(x_2|x_1) = \mathcal{N}(\mu_{x_1} + \Sigma_{x_1 x_2}\Sigma_{x_2}^{-1}(x_2-\mu_{x_2}),
\Sigma_{x_1}-\Sigma_{x_1 x_2}\Sigma_{x_2}^{-1}\Sigma_{x_1 x_2}^T)</script>
<p><img src="/assets/images/gpintro/output_15_0.png" alt="png" class="align-center" /></p>
<p>Notice that if we change the covariance function (either the form or the parameterization), we will change the dependence among points separated by a given distance. We will look at alternate forms of the covariance function a little later on.</p>
<h2 id="sampling-from-a-gaussian-process-prior">Sampling from a Gaussian Process Prior</h2>
<p>To make this notion of a “distribution over functions” more concrete, let’s quickly demonstrate how we obtain realizations from a Gaussian process, which result in an evaluation of a function over a set of points. All we will do here is sample from the <em>prior</em> Gaussian process, so before any data have been introduced. What we need first is our covariance function, which will be the squared exponential, and a function to evaluate the covariance at given points (resulting in a covariance matrix).</p>
<p>We are going generate realizations sequentially, point by point, using the lovely conditioning property of mutlivariate Gaussian distributions. Here is that conditional:</p>
<script type="math/tex; mode=display">p(x|y) = \mathcal{N}(\mu_x + \Sigma_{xy}\Sigma_y^{-1}(y-\mu_y),
\Sigma_x-\Sigma_{xy}\Sigma_y^{-1}\Sigma_{xy}^T)</script>
<p>And this is the function that implements it:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">conditional</span><span class="p">(</span><span class="n">x_new</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">params</span><span class="p">):</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">exponential_cov</span><span class="p">(</span><span class="n">x_new</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">exponential_cov</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">exponential_cov</span><span class="p">(</span><span class="n">x_new</span><span class="p">,</span> <span class="n">x_new</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">C</span><span class="p">).</span><span class="n">dot</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">T</span><span class="p">).</span><span class="n">T</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">A</span> <span class="o">-</span> <span class="n">B</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">C</span><span class="p">).</span><span class="n">dot</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">T</span><span class="p">))</span>
<span class="k">return</span><span class="p">(</span><span class="n">mu</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(),</span> <span class="n">sigma</span><span class="p">.</span><span class="n">squeeze</span><span class="p">())</span>
</code></pre></div></div>
<p>We will start with a Gaussian process prior with hyperparameters <script type="math/tex">\theta_0=1, \theta_1=10</script>. We will also assume a zero function as the mean, so we can plot a band that represents one standard deviation from the mean.</p>
<p><img src="/assets/images/gpintro/output_21_0.png" alt="png" class="align-center" /></p>
<p>Let’s select an arbitrary starting point to sample, say <script type="math/tex">x=1</script>. Since there are no prevous points, we can sample from an unconditional Gaussian:</p>
<p>We can now update our confidence band, given the point that we just sampled, using the covariance function to generate new point-wise intervals, conditional on the value <script type="math/tex">[x_0, y_0]</script>.</p>
<p><img src="/assets/images/gpintro/output_28_1.png" alt="png" class="align-center" /></p>
<p>From here we can see that the sample point which is the <strong>prior</strong> was refined to a <strong>posterior distribution</strong> just like in the case with the child. It shows that the function is no longer flat but rather close to the datapoint/sample point. We can also see that around the datapoint we sampled, we now have an increased confidence.</p>
<p>As we go through this process and get more and more data, we are scribing out a smooth regression function and we are getting more confident about the envelopes around it. Intuitively, that is a <strong>Gaussian Process</strong>.</p>
<p><img src="/assets/images/gpintro/output_34_0.png" alt="png" class="align-center" /></p>
<p>Of course, sampling sequentially is just a heuristic to demonstrate how the covariance structure works. We can just as easily sample several points at once:</p>
<p><img src="/assets/images/gpintro/output_37_0.png" alt="png" class="align-center" /></p>
<p>So as the density of points becomes high, the result will be one realization (function) from the prior GP.</p>
<p>This example, of course, is trivial because it is simply a random function drawn from the prior. What we are really interested in is <em>learning</em> about an underlying function from information residing in our data. In a parametric setting, we either specify a likelihood, which we then maximize with respect to the parameters, of a full probability model, for which we calculate the posterior in a Bayesian context. Though the integrals associated with posterior distributions are typically intractable for parametric models, they do not pose a problem with Gaussian processes.</p>
<p><img src="/assets/images/gpintro/output_42_0.png" alt="png" class="align-center" /></p>
<p>Here is a sample of 10 realizations, predicted over a denser set of x-values:</p>
<p><img src="/assets/images/gpintro/output_44_1.png" alt="png" class="align-center" /></p>
<p>While univariate Gaussians are distributions over real valued variables, multivariate Gaussians are pairs or finite numbers of distributions over real valued variables, Gaussina processes are functions (infinite number of distibutions over real valued variables. This in general drives us to a notion called <strong>regression</strong>.</p>
<p>Regressions are quite good for denoising and smoothing. They do not follow every noise in the data. They are also good at predicting and forecasting. For example, we might be interested and want to know what our temperature would be at a particular time in the future. However, regression sometimes have what is called danger of parametric models. For example, trying to fit a quadractic model to a data may lead to the model missing out some important features. There are also dangers of overfitting and underfitting. These are the areas <strong>Gaussian processes</strong> really shine because they take care of these issues very well.</p>
<p>A Gaussian process is fully specified by its mean and covariance function. Once these are chosen, they are manipulated using probability rules to obtain inference and prediction.</p>
<p>The mean and covariance functions will be the focus of my next post.</p>
<p>The Python codes used in producing these plots are found in my <a href="https://github.com/udohsolomon/GPIntroduction">Github</a></p>
<p>I will like to hear your thoughts on this and also appreciate any feedback.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li><a href="http://www.amazon.com/books/dp/026218253X">Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning series). The MIT Press.</a></li>
<li><a href="http://www.stat.columbia.edu/~gelman/book/">Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition. CRC Press</a></li>
</ul>Solomon AmosAs a researcher, I have the habit of learning new and interesting things. Most times these things turn out to become very useful one way or the other. Recently, I found myself in this rabbit hole of Gaussian processes.Defended!2017-11-28T00:00:00+00:002017-11-28T00:00:00+00:00https://udohsolomon.github.io/personal/phd-defence<p>Yay! I made it. It’s done. I have successfully defended my PhD thesis with minor corrections, supervised by Dr Prabhu Radhakrishna and examined by Emeritus Professor John Watson from the University of Aberdeen UK.</p>
<p>Looking back, I must say that the PhD experience has been an amazing one with both ups and downs. I am glad it was a happy end.</p>
<p>Moving on from the academia, I am ready for a new challenge. Most times when I talk to people, they often ask, “What’s next?” I tell them that I have ruled out a “traditional post-doc” or “the tenure track route”, and that I’m not married to the academic ivory tower, I’m married to my wife, and so if the timing doesn’t work out for independent research fellow position, I’m jumping out. Well, as things turned out, yes, I’m jumping out, and I’m looking forward to this new journey!</p>
<p>I’m very passionate about machine learning, data science and artificial intelligence. Scientists are artists in some senses, computational scientists particularly, and I think I’m ready for a new challenge.</p>Solomon AmosYay! I made it. It’s done. I have successfully defended my PhD thesis with minor corrections, supervised by Dr Prabhu Radhakrishna and examined by Emeritus Professor John Watson from the University of Aberdeen UK.Web Scraping and Automated Job Search in Python2017-11-01T00:00:00+00:002017-11-01T00:00:00+00:00https://udohsolomon.github.io/learning/web-scraping-and-automated-job-search-using-python<p>On my previous post <a href="https://udohsolomon.github.io/personal/a-reflection-on-my-phd-experience/">A reflection on my PhD experience</a> I stated that while waiting for my viva, I’m getting ready to start another phase of my career. I’m looking for a role as a machine learning researcher or a data scientist.</p>
<p>As someone who is very passionate about machine learning and data science and would want to make a difference, I would like to work for a company that would avail me the opportunity to work with an incredible group of smart and motivated people in tackling difficult problems. However, getting such role is not easy and does not come cheap. It requires looking out and applying for specific role that I think is best fit for my skillsets.</p>
<p>In this post, instead of the conventional job search, I would describe how I’m going to build an automated web scraper to collect and parse job posting information and then send them as email to me everyday with the various links that match my skillset.</p>
<p>For this fun project, I’ll be focusing on <a href="www.indeed.com">Indeed</a> a major job aggregator site that updates multiple times daily with new job postings and it is one of the top stops in many people’s job search.</p>
<p>Like I said, instead of going through the conventional job search way, by going through every single listing trying to figure out if a particular job post is the best fit for my skillset, I’m going to automate the process by creating a web scraping pipeline.</p>
<ol>
<li>Explore the indeed site for recent job listing relevant to me</li>
<li>Evaluate each one of them and identify the ones that are relevant to my skillset</li>
<li>Email me with links everyday the results.</li>
</ol>
<p>I will be building this tool using a package called <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">BeautifulSoup</a> in Python. I’ll be using different standalone functions in building the above listed pipeline.</p>
<h1 id="exploring-a-job-search-in-indeed">Exploring a job search in Indeed</h1>
<p>First I would like to know if a particular job listing is relevant to my skillset. I can achieve this by going through the job description to know if I am a good match. I would like to see some required skills Python, Data science, machine learning and research. With this simple step, I can write a program that can explore and evaluate that for me. The program can tell me the number of times these key words appear in the description.</p>
<p>Let’s start by pulling a single page, and working out the program to extract each piece of information we want:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Import the relevant packages and libraries
</span><span class="kn">import</span> <span class="nn">re</span> <span class="c1">#re stands for regular expressions
</span><span class="kn">import</span> <span class="nn">bs4</span> <span class="c1">#bs4 stands for BeautifulSoup
</span><span class="kn">import</span> <span class="nn">time</span> <span class="c1">#This is very relavant so that we don't overwhemlme the server
</span><span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">import</span> <span class="nn">smtplib</span>
<span class="c1">#We define the function explore_job that does the extraction and throws an exception when there is an error
</span><span class="k">def</span> <span class="nf">explore_job</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">job_html</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">request</span><span class="p">(</span><span class="s">'GET'</span><span class="p">,</span> <span class="n">url</span><span class="p">,</span> <span class="n">timeout</span> <span class="o">=</span> <span class="mi">10</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="c1">#specifying a desired format of “page” using the lxml parser
</span> <span class="c1">#this allows python to read the various components of the page, rather than treating it as one long string.
</span> <span class="n">job_soup</span> <span class="o">=</span> <span class="n">bs4</span><span class="p">.</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">job_html</span><span class="p">.</span><span class="n">content</span><span class="p">,</span> <span class="s">'lxml'</span><span class="p">)</span>
<span class="n">soup_body</span> <span class="o">=</span> <span class="n">job_soup</span><span class="p">(</span><span class="s">'body'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1">#Counting the number of times these keywords appear in the description
</span> <span class="n">python_count</span> <span class="o">=</span> <span class="n">soup_body</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'Python'</span><span class="p">)</span> <span class="o">+</span> <span class="n">soup_body</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'python'</span><span class="p">)</span>
<span class="n">ds_count</span> <span class="o">=</span> <span class="n">soup_body</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'Data Science'</span><span class="p">)</span> <span class="o">+</span> <span class="n">soup_body</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'data science'</span><span class="p">)</span>
<span class="n">ml_count</span> <span class="o">=</span> <span class="n">soup_body</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'Machine Learning'</span><span class="p">)</span> <span class="o">+</span> <span class="n">soup_body</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'machine learning'</span><span class="p">)</span>
<span class="n">research_count</span> <span class="o">=</span> <span class="n">soup_body</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'Research'</span><span class="p">)</span> <span class="o">+</span> <span class="n">soup_body</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">'research'</span><span class="p">)</span>
<span class="n">skill_count</span> <span class="o">=</span> <span class="n">python_count</span> <span class="o">+</span> <span class="n">ds_count</span> <span class="o">+</span> <span class="n">ml_count</span> <span class="o">+</span> <span class="n">research_count</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'ML count: {0}, Python count: {1}, DS count: {2}, Research count: {3}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">ml_count</span><span class="p">,</span> <span class="n">python_count</span><span class="p">,</span> <span class="n">ds_count</span><span class="p">,</span> <span class="n">research_count</span><span class="p">))</span>
<span class="k">return</span> <span class="n">skill_count</span>
</code></pre></div></div>
<p>Let us explore and evaluate a sample job post from indeed <a href="https://www.indeed.co.uk/jobs?q=data+scientist&l=United+Kingdom">Data Scientist role in the UK</a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">explore_job</span><span class="p">(</span><span class="s">'https://www.indeed.co.uk/jobs?q=data+scientist&l=United+Kingdom'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ML count: 7, Python count: 0, DS count: 5, Research count: 4
16
</code></pre></div></div>
<p>BOOM! We got some keywords that match our skillset. This shows that our explore function is working perfectly.</p>
<p>Now that we have got our explore function working, we will also like to extract other relevant information from a job posting like the title, company name, and the date posted.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Here we define a new function that extract the relevant information from the link
</span><span class="k">def</span> <span class="nf">extract_job_info</span><span class="p">(</span><span class="n">base_url</span><span class="p">):</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">base_url</span><span class="p">)</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">bs4</span><span class="p">.</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">,</span> <span class="s">'lxml'</span><span class="p">)</span>
<span class="c1">#Extracting specific attribute from the job listing
</span> <span class="n">tags</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'div'</span><span class="p">,</span> <span class="p">{</span><span class="s">'data-tn-component'</span> <span class="p">:</span> <span class="s">"organicJob"</span><span class="p">})</span>
<span class="n">companies_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">.</span><span class="n">span</span><span class="p">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tags</span><span class="p">]</span>
<span class="n">attrs_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">.</span><span class="n">h2</span><span class="p">.</span><span class="n">a</span><span class="p">.</span><span class="n">attrs</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tags</span><span class="p">]</span>
<span class="n">dates</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'span'</span><span class="p">,</span> <span class="p">{</span><span class="s">'class'</span><span class="p">:</span><span class="s">'date'</span><span class="p">})</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tags</span><span class="p">]</span>
<span class="c1"># update attributes dictionaries with company name and date posted
</span> <span class="p">[</span><span class="n">attrs_list</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">update</span><span class="p">({</span><span class="s">'company'</span><span class="p">:</span> <span class="n">companies_list</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">strip</span><span class="p">()})</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">attrs_list</span><span class="p">)]</span>
<span class="p">[</span><span class="n">attrs_list</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">update</span><span class="p">({</span><span class="s">'date posted'</span><span class="p">:</span> <span class="n">dates</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="p">.</span><span class="n">strip</span><span class="p">()})</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">attrs_list</span><span class="p">)]</span>
<span class="k">return</span> <span class="n">attrs_list</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">extract_job_info</span><span class="p">(</span><span class="s">'https://www.indeed.co.uk/jobs?q=machine+learning&l=United+Kingdom&sort=date'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[{'class': ['turnstileLink'],
'company': 'Amazon',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=0e368de1c08b5a4e&fccid=fe2d21eef233e94a',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[0],true,0);',
'onmousedown': 'return rclk(this,jobmap[0],0);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Data Scientist Analytics'},
{'class': ['turnstileLink'],
'company': 'Dyson',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=b33762682332d318&fccid=366382f52796fce2',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[1],true,0);',
'onmousedown': 'return rclk(this,jobmap[1],0);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Graduate Data Scientist 2018'},
{'class': ['turnstileLink'],
'company': 'Amazon',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=37e32588a52e6ffe&fccid=fe2d21eef233e94a',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[2],true,0);',
'onmousedown': 'return rclk(this,jobmap[2],0);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Software Development Engineer'},
{'class': ['turnstileLink'],
'company': 'Amazon',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=f8ebe3d424b3b0b4&fccid=fe2d21eef233e94a',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[3],true,0);',
'onmousedown': 'return rclk(this,jobmap[3],0);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Knowledge Engineer (Alexa)'},
{'class': ['turnstileLink'],
'company': 'Costello & Reyes Limited',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=79d80b892dd86ff2&fccid=63ff64eb3db45a69',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[4],true,1);',
'onmousedown': 'return rclk(this,jobmap[4],1);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Data Scientist'},
{'class': ['turnstileLink'],
'company': 'Oliver Bernard',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=b9164bda760cf8b8&fccid=370f24fdcffdfafd',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[5],true,1);',
'onmousedown': 'return rclk(this,jobmap[5],1);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Data Scientist'},
{'class': ['turnstileLink'],
'company': 'Time etc Limited',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/company/Time-etc-Limited/jobs/Onboarding-Assistant-99ab7961474dd43a?fccid=937f18048530601d',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[6],true,1);',
'onmousedown': 'return rclk(this,jobmap[6],1);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Onboarding Assistant'},
{'class': ['turnstileLink'],
'company': 'University of Edinburgh',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=eeffd16ed4cfb697&fccid=16c071074ab13fe5',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[7],true,1);',
'onmousedown': 'return rclk(this,jobmap[7],1);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Research Assistant in Surgical Informatics'},
{'class': ['turnstileLink'],
'company': 'InterQuest Group',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=b2bab32b0616c828&fccid=fc28cf1816ce0889',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[8],true,0);',
'onmousedown': 'return rclk(this,jobmap[8],0);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Data Scientist'},
{'class': ['turnstileLink'],
'company': 'Aimia',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=92b52618e0e8767d&fccid=0020889bfe35c4aa',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[9],true,0);',
'onmousedown': 'return rclk(this,jobmap[9],0);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Senior Analyst'}]
</code></pre></div></div>
<p>Good job boy! You can see we got some pretty good information in there. You can see the title, company name, link, and the date posted. Right, we are on the right tract, let’s move on.</p>
<p>Apart from looking at job titles alone, I might also be interested in looking for a position in some specific companies that I love, admire and have a dream of working with at some point in my career. They are called the BIG FIVE! Amazon, Apple, Microsoft, Google and Facebook.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dream_companies</span> <span class="o">=</span> <span class="p">[</span><span class="s">'amazon'</span><span class="p">,</span> <span class="s">'apple'</span><span class="p">,</span> <span class="s">'microsoft'</span><span class="p">,</span> <span class="s">'google'</span><span class="p">,</span> <span class="s">'facebook'</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#This function loops through the indeed side and looks for the recently posted ML jobs
</span><span class="k">def</span> <span class="nf">get_new_ml_jobs</span><span class="p">(</span><span class="n">days_ago_limit</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">starting_page</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">pages_limit</span> <span class="o">=</span> <span class="mi">10</span><span class="p">,</span> <span class="n">old_jobs_limit</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span>
<span class="n">location</span> <span class="o">=</span> <span class="s">'United Kingdom'</span><span class="p">,</span> <span class="n">query</span> <span class="o">=</span> <span class="s">'machine learning'</span><span class="p">):</span>
<span class="n">query_formatted</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">' '</span><span class="p">,</span> <span class="s">'+'</span><span class="p">,</span> <span class="n">query</span><span class="p">)</span>
<span class="n">location_formatted</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">' '</span><span class="p">,</span> <span class="s">'+'</span><span class="p">,</span> <span class="n">location</span><span class="p">)</span>
<span class="n">indeed_url</span> <span class="o">=</span> <span class="s">'http://www.indeed.co.uk/jobs?q={0}&l={1}&sort=date&start='</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">query_formatted</span><span class="p">,</span> <span class="n">location_formatted</span><span class="p">)</span>
<span class="n">old_jobs_counter</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">new_jobs_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">starting_page</span><span class="p">,</span> <span class="n">starting_page</span> <span class="o">+</span> <span class="n">pages_limit</span><span class="p">):</span>
<span class="k">if</span> <span class="n">old_jobs_counter</span> <span class="o">>=</span> <span class="n">old_jobs_limit</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'URL: {0}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">indeed_url</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="o">*</span><span class="mi">10</span><span class="p">)),</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="c1"># extract job data from Indeed page
</span> <span class="n">attrs_list</span> <span class="o">=</span> <span class="n">extract_job_info</span><span class="p">(</span><span class="n">indeed_url</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="o">*</span><span class="mi">10</span><span class="p">))</span>
<span class="c1"># loop through each job, breaking out if we're past the old jobs limit
</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">attrs_list</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="k">if</span> <span class="n">old_jobs_counter</span> <span class="o">>=</span> <span class="n">old_jobs_limit</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">href</span> <span class="o">=</span> <span class="n">attrs_list</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'href'</span><span class="p">]</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">attrs_list</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'title'</span><span class="p">]</span>
<span class="n">company</span> <span class="o">=</span> <span class="n">attrs_list</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'company'</span><span class="p">]</span>
<span class="n">date_posted</span> <span class="o">=</span> <span class="n">attrs_list</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'date posted'</span><span class="p">]</span>
<span class="c1"># if posting date is beyond the limit, add to the counter and skip
</span> <span class="k">try</span><span class="p">:</span>
<span class="k">if</span> <span class="nb">int</span><span class="p">(</span><span class="n">date_posted</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">>=</span> <span class="n">days_ago_limit</span><span class="p">:</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Adding to old_jobs_counter.'</span><span class="p">)</span>
<span class="n">old_jobs_counter</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">continue</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'{0}, {1}, {2}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">company</span><span class="p">),</span> <span class="nb">repr</span><span class="p">(</span><span class="n">title</span><span class="p">),</span> <span class="nb">repr</span><span class="p">(</span><span class="n">date_posted</span><span class="p">)))</span>
<span class="c1"># Explore and evaluate the job
</span> <span class="n">exploration</span> <span class="o">=</span> <span class="n">explore_job</span><span class="p">(</span><span class="s">'http://indeed.co.uk'</span> <span class="o">+</span> <span class="n">href</span><span class="p">)</span>
<span class="k">if</span> <span class="n">exploration</span> <span class="o">>=</span> <span class="mi">1</span> <span class="ow">or</span> <span class="n">company</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span> <span class="ow">in</span> <span class="n">dream_companies</span><span class="p">:</span>
<span class="n">new_jobs_list</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">'{0}, {1}, {2}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">company</span><span class="p">,</span> <span class="n">title</span><span class="p">,</span> <span class="s">'http://indeed.co.uk'</span> <span class="o">+</span> <span class="n">href</span><span class="p">))</span>
<span class="c1">#This is vital because it allows the page to completly before the extraction and listing
</span> <span class="k">print</span> <span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">15</span><span class="p">)</span>
<span class="n">new_jobs_string</span> <span class="o">=</span> <span class="s">'</span><span class="se">\n\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">new_jobs_list</span><span class="p">)</span>
<span class="k">return</span> <span class="n">new_jobs_string</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_new_ds_jobs</span><span class="p">(</span><span class="n">days_ago_limit</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">starting_page</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">pages_limit</span> <span class="o">=</span> <span class="mi">10</span><span class="p">,</span> <span class="n">old_jobs_limit</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span>
<span class="n">location</span> <span class="o">=</span> <span class="s">'United Kingdom'</span><span class="p">,</span> <span class="n">query</span> <span class="o">=</span> <span class="s">'data scientist'</span><span class="p">):</span>
<span class="n">query_formatted</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">' '</span><span class="p">,</span> <span class="s">'+'</span><span class="p">,</span> <span class="n">query</span><span class="p">)</span>
<span class="n">location_formatted</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">' '</span><span class="p">,</span> <span class="s">'+'</span><span class="p">,</span> <span class="n">location</span><span class="p">)</span>
<span class="n">indeed_url</span> <span class="o">=</span> <span class="s">'http://www.indeed.co.uk/jobs?q={0}&l={1}&sort=date&start='</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">query_formatted</span><span class="p">,</span> <span class="n">location_formatted</span><span class="p">)</span>
<span class="n">old_jobs_counter</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">new_jobs_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">starting_page</span><span class="p">,</span> <span class="n">starting_page</span> <span class="o">+</span> <span class="n">pages_limit</span><span class="p">):</span>
<span class="k">if</span> <span class="n">old_jobs_counter</span> <span class="o">>=</span> <span class="n">old_jobs_limit</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'URL: {0}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">indeed_url</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="o">*</span><span class="mi">10</span><span class="p">)),</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="c1"># extract job data from Indeed page
</span> <span class="n">attrs_list</span> <span class="o">=</span> <span class="n">extract_job_info</span><span class="p">(</span><span class="n">indeed_url</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="o">*</span><span class="mi">10</span><span class="p">))</span>
<span class="c1"># loop through each job, breaking out if we're past the old jobs limit
</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">attrs_list</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="k">if</span> <span class="n">old_jobs_counter</span> <span class="o">>=</span> <span class="n">old_jobs_limit</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">href</span> <span class="o">=</span> <span class="n">attrs_list</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'href'</span><span class="p">]</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">attrs_list</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'title'</span><span class="p">]</span>
<span class="n">company</span> <span class="o">=</span> <span class="n">attrs_list</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'company'</span><span class="p">]</span>
<span class="n">date_posted</span> <span class="o">=</span> <span class="n">attrs_list</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'date posted'</span><span class="p">]</span>
<span class="c1"># if posting date is beyond the limit, add to the counter and skip
</span> <span class="k">try</span><span class="p">:</span>
<span class="k">if</span> <span class="nb">int</span><span class="p">(</span><span class="n">date_posted</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">>=</span> <span class="n">days_ago_limit</span><span class="p">:</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Adding to old_jobs_counter.'</span><span class="p">)</span>
<span class="n">old_jobs_counter</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">continue</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'{0}, {1}, {2}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">company</span><span class="p">),</span> <span class="nb">repr</span><span class="p">(</span><span class="n">title</span><span class="p">),</span> <span class="nb">repr</span><span class="p">(</span><span class="n">date_posted</span><span class="p">)))</span>
<span class="c1"># evaluate the job
</span> <span class="n">exploration</span> <span class="o">=</span> <span class="n">explore_job</span><span class="p">(</span><span class="s">'http://indeed.co.uk'</span> <span class="o">+</span> <span class="n">href</span><span class="p">)</span>
<span class="k">if</span> <span class="n">exploration</span> <span class="o">>=</span> <span class="mi">1</span> <span class="ow">or</span> <span class="n">company</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span> <span class="ow">in</span> <span class="n">dream_companies</span><span class="p">:</span>
<span class="n">new_jobs_list</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">'{0}, {1}, {2}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">company</span><span class="p">,</span> <span class="n">title</span><span class="p">,</span> <span class="s">'http://indeed.co.uk'</span> <span class="o">+</span> <span class="n">href</span><span class="p">))</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">15</span><span class="p">)</span>
<span class="n">new_jobs_string</span> <span class="o">=</span> <span class="s">'</span><span class="se">\n\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">new_jobs_list</span><span class="p">)</span>
<span class="k">return</span> <span class="n">new_jobs_string</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#links = get_new_ds_jobs('https://www.indeed.co.uk/jobs?q=machine+learning&l=United+Kingdom&sort=date')[:-1]
</span></code></pre></div></div>
<h1 id="sending-the-new-jobs-as-email-to-myself">Sending the new jobs as email to myself</h1>
<p>With the help of the smtplib library the email process is very easy. We will define a function that can send us email of our new jobs search for different scenerio.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">send_gmail</span><span class="p">(</span><span class="n">from_addr</span> <span class="o">=</span> <span class="s">'your name <email address>'</span><span class="p">,</span> <span class="n">to_addr</span> <span class="o">=</span> <span class="s">'email address'</span><span class="p">,</span>
<span class="n">location</span> <span class="o">=</span> <span class="s">'United Kingdom'</span><span class="p">,</span>
<span class="n">subject</span> <span class="o">=</span> <span class="s">'Daily Data Science and Machine Learning Jobs Update Scraped from Indeed'</span><span class="p">,</span> <span class="n">text</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
<span class="n">message</span> <span class="o">=</span> <span class="s">'Subject: {0}</span><span class="se">\n\n</span><span class="s">Jobs in: {1}</span><span class="se">\n\n</span><span class="s">{2}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">subject</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="c1"># login information
</span> <span class="n">username</span> <span class="o">=</span> <span class="s">'******'</span>
<span class="n">password</span> <span class="o">=</span> <span class="s">'******'</span>
<span class="c1"># send the message
</span> <span class="n">server</span> <span class="o">=</span> <span class="n">smtplib</span><span class="p">.</span><span class="n">SMTP</span><span class="p">(</span><span class="s">'smtp.gmail.com:587'</span><span class="p">)</span>
<span class="n">server</span><span class="p">.</span><span class="n">ehlo</span><span class="p">()</span>
<span class="n">server</span><span class="p">.</span><span class="n">starttls</span><span class="p">()</span>
<span class="n">server</span><span class="p">.</span><span class="n">login</span><span class="p">(</span><span class="n">username</span><span class="p">,</span> <span class="n">password</span><span class="p">)</span>
<span class="n">server</span><span class="p">.</span><span class="n">sendmail</span><span class="p">(</span><span class="n">from_addr</span><span class="p">,</span> <span class="n">to_addr</span><span class="p">,</span> <span class="n">message</span><span class="p">)</span>
<span class="n">server</span><span class="p">.</span><span class="n">quit</span><span class="p">()</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Email sent.'</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="putting-all-the-pieces-together">Putting all the pieces Together</h1>
<p>We want to make sure the code only runs when we execute the code as a stand-alone program, as opposed to if we import it into another program. The if <strong>name</strong> == “<strong>main</strong>” let’s me do just that.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Scraping Indeed now.'</span><span class="p">)</span>
<span class="n">start_page</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">page_limit</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">location</span> <span class="o">=</span> <span class="s">'United Kingdom'</span>
<span class="n">machine_learning_jobs</span> <span class="o">=</span> <span class="n">get_new_ml_jobs</span><span class="p">(</span><span class="n">query</span> <span class="o">=</span> <span class="s">'machine learning'</span><span class="p">,</span> <span class="n">starting_page</span> <span class="o">=</span> <span class="n">start_page</span><span class="p">,</span>
<span class="n">location</span> <span class="o">=</span> <span class="n">location</span><span class="p">,</span> <span class="n">pages_limit</span> <span class="o">=</span> <span class="n">page_limit</span><span class="p">,</span> <span class="n">days_ago_limit</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">old_jobs_limit</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">send_gmail</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">machine_learning_jobs</span><span class="p">,</span> <span class="n">location</span> <span class="o">=</span> <span class="n">location</span><span class="p">)</span>
<span class="n">data_scientist_jobs</span> <span class="o">=</span> <span class="n">get_new_ds_jobs</span><span class="p">(</span><span class="n">query</span> <span class="o">=</span> <span class="s">'data scientist'</span><span class="p">,</span> <span class="n">starting_page</span> <span class="o">=</span> <span class="n">start_page</span><span class="p">,</span>
<span class="n">location</span> <span class="o">=</span> <span class="n">location</span><span class="p">,</span> <span class="n">pages_limit</span> <span class="o">=</span> <span class="n">page_limit</span><span class="p">,</span> <span class="n">days_ago_limit</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">old_jobs_limit</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">send_gmail</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">data_scientist_jobs</span><span class="p">,</span> <span class="n">location</span> <span class="o">=</span> <span class="n">location</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">main</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Scraping Indeed now.
URL: http://www.indeed.co.uk/jobs?q=machine+learning&l=United+Kingdom&sort=date&start=0
'Amazon', 'Data Scientist Analytics', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
'Dyson', 'Graduate Data Scientist 2018', 'Just posted'
ML count: 1, Python count: 0, DS count: 1, Research count: 2
'Amazon', 'Software Development Engineer', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
'Amazon', 'Knowledge Engineer (Alexa)', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
'Costello & Reyes Limited', 'Data Scientist', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
'Oliver Bernard', 'Data Scientist', 'Just posted'
ML count: 6, Python count: 2, DS count: 5, Research count: 2
'Time etc Limited', 'Onboarding Assistant', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 1
'University of Edinburgh', 'Research Assistant in Surgical Informatics', 'Just posted'
ML count: 0, Python count: 1, DS count: 5, Research count: 55
'InterQuest Group', 'Data Scientist', 'Just posted'
ML count: 4, Python count: 5, DS count: 4, Research count: 0
URL: http://www.indeed.co.uk/jobs?q=machine+learning&l=United+Kingdom&sort=date&start=10
'Caterpillar', 'Inventory Cycle Counter', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 1
'Square One Resources', 'Python Developer', 'Just posted'
ML count: 2, Python count: 14, DS count: 2, Research count: 0
'Kent Police and Essex Police', 'Contingency Planning Officer - OPC Boreham', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
'Quantcast', 'Software Engineering Intern - Summer 2018', 'Just posted'
ML count: 0, Python count: 1, DS count: 0, Research count: 0
'University of Edinburgh', 'Postdoctoral Research Associate in High Pressure Physics', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 44
'Hitachi Consulting UK Limited', 'Business Development Executive - Analytics', 'Just posted'
ML count: 4, Python count: 0, DS count: 2, Research count: 1
'Burden Bros', 'Skilled Groundworker', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 1
'NTT DATA Services', 'Enterprise Architect - Data / AI', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 1
'Burden Bros', 'Civil Working Foreman', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 1
Email sent.
URL: http://www.indeed.co.uk/jobs?q=data+scientist&l=United+Kingdom&sort=date&start=0
'Dyson', 'Graduate Data Scientist 2018', 'Just posted'
ML count: 1, Python count: 0, DS count: 1, Research count: 2
'Amazon', 'Data Scientist Analytics', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
'HM Courts and Tribunals Service', 'Senior Performance Analyst and Data Scientist', 'Just posted'
ML count: 0, Python count: 0, DS count: 5, Research count: 5
'Oliver Bernard', 'Data Scientist', 'Just posted'
ML count: 6, Python count: 2, DS count: 5, Research count: 2
'Costello & Reyes Limited', 'Data Scientist', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
'Lancaster University', 'Research Associate in Nuclear Robotics Manipulator Control', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 7
'G-Research', 'Quantitative Operations Analyst', 'Just posted'
ML count: 3, Python count: 0, DS count: 0, Research count: 19
'InterQuest Group', 'Data Scientist', 'Just posted'
ML count: 4, Python count: 5, DS count: 4, Research count: 0
'University of Edinburgh', 'Research Assistant in Surgical Informatics', 'Just posted'
ML count: 0, Python count: 1, DS count: 5, Research count: 55
URL: http://www.indeed.co.uk/jobs?q=data+scientist&l=United+Kingdom&sort=date&start=10
'Lancaster University', 'Research Associate in Nuclear Robotics Control and Navigation', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 7
'IntaPeople', 'Data Engineer / Data Scientist', 'Just posted'
ML count: 0, Python count: 2, DS count: 2, Research count: 0
'GlaxoSmithKline', 'Category Medical Affairs Principal Scientist- Skin Health', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 3
'Amazon', 'Knowledge Engineer (Alexa)', 'Just posted'
'RZ Group', 'Quantitative Analyst', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 1
'SCI', 'Research Scientist, Antibody Discovery', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
'Talent Point', 'Senior Data Analyst', 'Just posted'
ML count: 0, Python count: 4, DS count: 6, Research count: 0
'Springer Nature', 'Associate Marketing Director, Journals', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 10
"J Sainsbury's", 'Senior Software Engineer - Android', 'Just posted'
ML count: 0, Python count: 0, DS count: 0, Research count: 0
Email sent.
</code></pre></div></div>
<p>Here is the screenshot of the mail the scraping tool sent to me.</p>
<figure>
<a href="/assets/images/gmail.PNG"><img src="/assets/images/gmail.PNG" /></a>
</figure>
<h1 id="conclusion">Conclusion</h1>
<p>With the use of cron-job, we can set up our program to send us email every day with newly posted machine learning and data science jobs.</p>
<p>While this is particular useful, its about finding jobs that I might be a good fit for – not jobs that might be a good fit for me.</p>Solomon AmosOn my previous post A reflection on my PhD experience I stated that while waiting for my viva, I’m getting ready to start another phase of my career. I’m looking for a role as a machine learning researcher or a data scientist.My Python Deliberate Practice2017-10-22T00:00:00+00:002017-10-22T00:00:00+00:00https://udohsolomon.github.io/personal/My-python-deliberate-practice<p>First of all, don’t be afraid, read <a href="http://pbpython.com/plateau-of-productivity.html">Plateau of Productivity</a>. More importantly, be patient, a good read from Peter Norvig, titled <a href="http://norvig.com/21-days.html">Teach Yourself Programming in 10 years</a>.</p>
<p>“Researchers have shown it takes about ten years to develop expertise in any of a wide variety of areas, including chess playing, music composition, telegraph operation, painting, piano playing, swimming, tennis, and research in neuropsychology and topology. The key is deliberative practice: not just doing it again and again, but challenging yourself with a task that is just beyond your current ability, trying it, analyzing your performance while and after doing it, and correcting any mistakes. Then repeat. And repeat again. There appear to be no real shortcuts” - <a href="http://norvig.com/21-days.html">Teach Yourself Programming in 10 years</a></p>
<h2 id="motivation">Motivation</h2>
<p>I mainly used Matlab extensively during my PhD program for different purposes; to carry out simulation, data analyses and visualization.</p>
<p>By now, I have a pretty good working knowledge of Matlab. There are obviously many more things that I can learn - in particular building and maintaining matlab modules as well as more <a href="https://blogs.mathworks.com/loren/">advanced Matlab materials</a>. However, the appeal to Python has always been there for me for a few reasons as I focus on machine learning and data science:</p>
<ul>
<li>It’s a general purpose programming language, so presumably it is a lot easier to learn good software engineering principles. (What are they though?)</li>
<li>Many of the <a href="https://lab.getbase.com/productive-data-science-python/">data stacks</a> are built using the tools in the Python ecosystem (ETL using Airflow, Front-end using Flask with RESTful API supports, Machine Learning using scikit-learn) - being able to use the same language for different parts of the data stack will bring prototypes closer to production.</li>
</ul>
<p>To me, the appeal of Python is not only necessarily the Data Analysis part, the appeal of using Python for data work is that you have a higher chance to see how data plays a role within the whole integrated technology stack. Knowing Python is likely to make me a better <strong>end-to-end</strong> Data Scientist and better machine learning Engineer.</p>
<p>Here is a great <a href="https://www.reddit.com/r/Python/comments/2tkkxd/considering_putting_my_efforts_into_python/">reddit answer</a> that explains the intersection and disjoint union of the two languages beautifully.</p>
<h2 id="deliberate-practice">Deliberate Practice</h2>
<p>I am a huge believer in learning by doing, and there are a lot of opportunities on the job where I can hone my Python skills through Deliberate Practice:</p>
<ul>
<li>
<p><strong>Identify the Top Performers</strong>: I think there are quite a few people (e.g. Robert C.) who can really be a role model for me to follow. Understand what they’ve been through to get to where they are today. What is their mental representation that I do not have about Python.</p>
</li>
<li>
<p><strong>Build Practice Plans</strong>: Ideally, based on the rough understanding of that mental representation:</p>
<ul>
<li>Define clear goals and select learning materials</li>
<li>Create deadline and milestones for the project</li>
<li>Estimate time required and come up weekly schedules</li>
</ul>
<p>Augment these insights with your current level of mental representation of Python to improve your understanding.</p>
</li>
<li>
<p><strong>Targeted Practice</strong>: If I force myself to switch over to Python for Data Analysis, Data visualization, Modeling, or contribute to open source Python Data Analysis packages, I can maximize my time practicing this skill, which is high leverage.</p>
</li>
<li>
<p><strong>Immediate Feedbacks</strong>: The importance of feedback cannot be overemphsized. I have built the culture of constantly sending my codes to friends, my connections online for review, critique and feedback. Find constant opportunities to get feedback as much as you can.</p>
</li>
</ul>
<h2 id="performance-goals">Performance Goals</h2>
<ul>
<li><strong>[Immediate]</strong> Learn to write pythonic code</li>
<li><strong>[Shorter term, easiest to practice]</strong> Write re-usable, modular, tested code for my data work and knowledge posts</li>
<li><strong>[Medium term, harder to practice]</strong> Achieve efficiency and feature parity on Data Analysis using Python compared to R</li>
<li><strong>[Longer term, hardest to practice]</strong> Write tools. Being able to work on projects that span the entire data stack using Python, apply good software engineering principles to these projects</li>
</ul>
<h2 id="project-goals">Project Goals</h2>
<ul>
<li>
<p><strong>Outcome</strong>: I want to move my data stack to Python completely. This means my day-to-day data analysis work will be done in Python instead of R, make my code as pythonic as possible. Become a Contributor to Airpy / tools, and take on one bigger Python project (ML, Data Viz …etc).</p>
</li>
<li>
<p><strong>Curriculum</strong>: I want do everything that I can to go through all the basic materials in Pandas/Matplotlib combo. Expose myself to functional programming, OOP, testing in Python, or even making command tools. Get feedbacks from experts.</p>
</li>
<li>
<p><strong>Timeframe</strong>: Efficiency parity by end of December, 2017. One ongoing big project touching different stacks in Python by the end of 2017.</p>
</li>
</ul>
<h2 id="project-milestones">Project Milestones</h2>
<ul>
<li><strong>Learning Python & Best Practices</strong>
<ul>
<li><a href="http://stackoverflow.com/questions/2573135/python-progression-path-from-apprentice-to-guru">Build On Top of the Basics: Python Progression</a></li>
<li><a href="https://www.jeffknupp.com/blog/2013/02/14/drastically-improve-your-python-understanding-pythons-execution-model/">Drastically Improve Your Understanding: Jeff Knupp: Python’s Execution Mode</a></li>
<li><a href="https://www.youtube.com/watch?time_continue=14&v=EnSu9hHGq5o">Nate Batchelder: Loop like a native</a></li>
<li><a href="http://columbia-applied-data-science.github.io/pages/lowclass-python-style-guide.html">Columbia Data Scientist Style Guide</a></li>
</ul>
</li>
<li><strong>Writing Pythonic Code</strong>
<ul>
<li>Guidelines For Writing Pythonic Code
<ul>
<li>Function: Use *args and **kwargs to accept arbitrary arguments in function definition</li>
<li>Tuples: effective unpacking, use _ for placeholder, swap values without tmp variables</li>
<li>List/Dict/Set: list comprehension, dict comprehension. dict.get, set comprehension</li>
<li>Strings: use .format, use .join</li>
<li>Classes: use __ __ in function and variable name to mark private variables</li>
<li>Generator: use generator to lazily load a infinite sequence</li>
<li>Modules: writing modules for encapsulation</li>
<li>Formatting: pep8 standards</li>
<li>Executable script: <strong>name</strong> = <strong>main</strong></li>
<li>Import: The right way to do imports</li>
</ul>
</li>
<li><a href="https://jeffknupp.com/writing-idiomatic-python-ebook/">Writing Idiomatic Python - Jeff Knupp</a></li>
<li><a href="https://drive.google.com/file/d/0B-eHIhYpHrGDNGZCYUN6SVB1OGc/view">Stanford CS 41: Idiomatic Python</a></li>
<li><a href="http://safehammad.com/downloads/python-idioms-2014-01-16.pdf">Another Tutorial On How To Write Pythonic Code</a></li>
</ul>
</li>
<li>
<p><strong>iPython Notebook</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=HrylK8I1ALs&index=3&list=PLKW2Azk23ZtSeBcvJi0JnL7PapedOvwz9">BIDS: Python Bootcamp: IPython Notebook</a></li>
<li><a href="https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/">Jupyter Notebook tips, tricks and shortcuts</a></li>
<li><a href="https://www.webucator.com/blog/wp-content/uploads/2015/07/IPython-Notebook-Shortcuts.pdf">iPython Notebook Keybinding</a></li>
</ul>
</li>
<li>
<p><strong>Pandas For Data Analysis</strong></p>
<ul>
<li>Introduction to Numpy
<ul>
<li><a href="https://www.youtube.com/watch?v=PDOsOcG0m-Q">BIDS: Python Bootcamp: Intro to Numpy</a></li>
<li><a href="http://stanford.edu/~arbenson/cme193.html">Stanford ICME 193: Scientific Python</a></li>
</ul>
</li>
<li>Introduction to Pandas
<ul>
<li><a href="http://nbviewer.jupyter.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0">Dplyr/pandas Vignette Comparison</a></li>
<li><a href="http://www.dataschool.io/easier-data-analysis-with-pandas/">Data School Pandas Tutorials</a>
<ul>
<li><a href="https://github.com/justmarkham/pandas-videos">Data School Pandas Github iPython notebook</a></li>
<li><a href="https://www.youtube.com/watch?v=CWRKgBtZN18&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=31">More Pandas Questions Answered</a></li>
<li><a href="http://www.dataschool.io/best-python-pandas-resources/">Other Resources</a></li>
</ul>
</li>
<li><a href="https://www.youtube.com/watch?v=5JnMutdy6Fw">Brandon Rhode’s Pandas From The Groud Up</a></li>
<li><a href="https://www.youtube.com/watch?v=otCriSKVV_8">Tom Augspurgur: Pandas</a></li>
<li><a href="https://www.youtube.com/watch?v=bgIZAeNpL1U">BIDS: Python Bootcamp: Scipy Pandas</a></li>
<li><a href="https://www.coursera.org/learn/python-data-analysis/home/welcome">Coursera: Introduction to Data Science in Python</a></li>
<li><a href="http://chrisalbon.com/">Chris Albon’s notes</a></li>
</ul>
</li>
</ul>
</li>
<li>
<p><strong>Data Visualization</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=j5P822TSCKs">BIDS: Python Bootcamp: Intro to Matplotlib</a> The 800 pound gorilla, everything is customizable, but very low level</li>
<li><a href="https://stanford.edu/~mwaskom/software/seaborn/">Seaborn</a> Good for statistical visualization. I still find it a bit limited on the type of simple plots it can do</li>
<li><a href="http://bokeh.pydata.org/en/latest/">Bokeh</a> Interactive, web browser base data visualization</li>
<li><a href="https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/">A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot and Altair)</a></li>
</ul>
</li>
<li>
<p><strong>Writing Object Oriented Programming Python Code</strong></p>
<ul>
<li><a href="http://tjelvarolsson.com/blog/object-oriented-programming-for-scientists/">Computational Biology: OOP For Scientist</a></li>
<li><a href="https://jeffknupp.com/blog/2014/06/18/improve-your-python-python-classes-and-object-oriented-programming/">Improve Your Python: Jeff Knupp: OOP</a></li>
<li><a href="https://www.youtube.com/watch?v=HQ0q6oMpOEs">BIDS: Python Bootcamp: OOP</a></li>
<li>Simeon Franklin’s Twitter University Class (not available to the public)</li>
</ul>
</li>
<li>
<p><strong>Writing Functional Programming Python Code</strong></p>
<ul>
<li><a href="http://simeonfranklin.com/blog/2013/jun/17/higher-order-functions-python/">Simeon Franklin’s higher order function</a></li>
<li><a href="https://www.youtube.com/watch?v=ob797BA49ZQ">BIDS: Python Bootcamp: Higher order functions</a></li>
<li><a href="https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/">Improve Your Python: Jeff Knupp: Yield & Generator Explained</a></li>
<li><a href="https://jeffknupp.com/blog/2013/11/29/improve-your-python-decorators-explained/">Improve Your Python: Jeff Knupp: Decorator Explained</a></li>
<li><a href="https://jeffknupp.com/blog/2016/03/07/improve-your-python-the-with-statement-and-context-managers/">Improve Your Python: Jeff Knupp: Context Manager</a></li>
</ul>
</li>
<li><strong>Machine Learning In Python</strong>
<ul>
<li><a href="http://www.dataschool.io/machine-learning-with-scikit-learn/">Scikit-learn Machine Learning Library</a></li>
<li><a href="http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics">Scikit-learn metrics</a></li>
<li><a href="http://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline">Scikit-learn Pipeline</a></li>
</ul>
</li>
<li>
<p><strong>Testing In Python</strong></p>
<ul>
<li><a href="http://tjelvarolsson.com/blog/four-tools-for-testing-your-python-code/">Computational Biology: Four Tools For Testing Your Python Code</a></li>
<li><a href="http://tjelvarolsson.com/blog/test-driven-develpment-for-scientists/">Computational Biology: Testing For Scientist</a></li>
<li><a href="https://jeffknupp.com/blog/2013/12/09/improve-your-python-understanding-unit-testing/">Improve Your Python: Jeff Knupp: Understanding Unit Testing</a></li>
<li><a href="https://www.youtube.com/watch?v=hrj8Wo34nvw">BIDS: Python Bootcamp: Test Driven Development</a></li>
<li><a href="http://katyhuff.github.io/python-testing/">Software Carpentry: Testing</a></li>
</ul>
</li>
</ul>
<h2 id="next-steps--level-in-2018">Next Steps / Level In 2018</h2>
<p>Once mastered all the above, the next natural step is to create public work that other people can use so you can democratize your useful tool to others. A great introduction to how to get started is from Tim Hopper’s talk, titled <a href="https://www.youtube.com/watch?v=uRul8QdYvqQ">Sharing Your Side Projects</a>.</p>
<ul>
<li><strong>Logging In Python (Next Year?)</strong>
<ul>
<li><a href="https://www.youtube.com/watch?v=PX_xd2YjrsU">Basic Python Logging - Code Session</a></li>
<li><a href="https://docs.python.org/2/howto/logging.html">Logging HOWTO</a></li>
<li><a href="https://www.youtube.com/watch?v=24_4WWkSmNo">Become A Logging Expert In 30 Minutes</a></li>
</ul>
</li>
<li><strong>Writing Command-Line Tool (Next Year?)</strong>
<ul>
<li><a href="http://click.pocoo.org/5/quickstart/">Click Documentation</a></li>
<li><a href="http://nvie.com/posts/writing-a-cli-in-python-in-under-60-seconds/">Writing A Command-Line Tool In Python</a></li>
</ul>
</li>
<li><strong>Building Packages In Python (Next Year?)</strong>
<ul>
<li><a href="http://tjelvarolsson.com/blog/using-cookiecutter-a-passive-code-generator/">Computational Biology: Using Cookiecutter To Set Up A Project</a></li>
<li><a href="http://tjelvarolsson.com/blog/begginers-guide-creating-clean-python-development-environments/">Computational Biology: Creating A Clean Pytyon Development Environment</a></li>
<li><a href="http://tjelvarolsson.com/blog/how-to-generate-beautiful-technical-documentation/">Computational Biology: How To Generate Beautiful Technical Documentation</a></li>
<li><a href="http://tjelvarolsson.com/blog/five-steps-to-add-the-bling-factor-to-your-python-package/">Computational Biology: Five Steps To Add The Bling Factor Your Python Package</a></li>
</ul>
</li>
</ul>
<h2 id="reference">Reference</h2>
<ul>
<li><a href="http://www.pythontutor.com/visualize.html#mode=edit">Python Tutor Visualizer</a></li>
<li><a href="https://www.kevinsheppard.com/images/0/09/Python_introduction.pdf">Python For Data Analysis</a></li>
<li><a href="http://stanfordpython.com/">Stanford CS 41: Python</a></li>
<li><a href="http://cs88-website.github.io/">Berkeley CS 88: Python Data Structure</a></li>
<li><a href="http://cs109.github.io/2015/">Harvard CS 109: Data Science</a></li>
<li><a href="https://bids.berkeley.edu/news/python-boot-camp-fall-2016-training-videos-available-online">Berkeley BIDS Python bootcamp</a></li>
<li><a href="https://github.com/profjsb/python-seminar">Josh Bloom’s Python Computing For Data Science</a></li>
<li><a href="https://jeffknupp.com/writing-idiomatic-python-ebook/">Writing Idiomatic Python - Jeff Knupp</a></li>
<li><a href="http://safehammad.com/downloads/python-idioms-2014-01-16.pdf">Another Tutorial On How To Write Pythonic Code</a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/cookbook.html">Pandas Cookbook</a></li>
<li><a href="https://www.udemy.com/learning-python-for-data-analysis-and-visualization/?ccManual=&couponCode=DEAL19">Udemy course</a></li>
</ul>Solomon AmosFirst of all, don’t be afraid, read Plateau of Productivity. More importantly, be patient, a good read from Peter Norvig, titled Teach Yourself Programming in 10 years.A reflection on my PhD experience2017-10-18T00:00:00+00:002017-10-18T00:00:00+00:00https://udohsolomon.github.io/personal/a-reflection-on-my-phd-experience<p>Now that my PhD program is coming to an end, while waiting for my viva, I’m getting ready to start another phase of my career. I’m looking for a role as a machine learning researcher or a data scientist.</p>
<p>As I reflect on my PhD experience, I could write a thousand word about how my initial idea was to change the world when finishing or after my PhD to when my supervisor thought I could end up finishing my program within 2 years with the rate I started to when I got into the valley of shit (a period in your PhD however brief, when you lose perspective and therefore confidence and belief in your ability).</p>
<p>The valley of shit is a terrible place to be, it smells. No one walks with you down there, no matter how close they are to you. No matter how reassuring those encouraging words can be, somehow you still have the feeling that the valley has no end. Despite the many challenges, I never lose faith, I kept fighting, kept walking in that valley and kept pushing. By which I mean I kept writing, doing analyses, kept reading and reproducing bunch of papers and kept writing bunch of codes. Now I can gladly say it’s been an amazing experience.</p>
<p>The PhD experience has been an amazing one, it has shaped me, and affords me to not just answer a given question, but to also define the question. Sciences are primary defined by thier questions rather than by their tools. It has also helped me to solve independent problems and strive for excellence at all time.</p>Solomon AmosNow that my PhD program is coming to an end, while waiting for my viva, I’m getting ready to start another phase of my career. I’m looking for a role as a machine learning researcher or a data scientist. As I reflect on my PhD experience, I could write a thousand word about how my initial idea was to change the world when finishing or after my PhD to when my supervisor thought I could end up finishing my program within 2 years with the rate I started to when I got into the valley of shit (a period in your PhD however brief, when you lose perspective and therefore confidence and belief in your ability).Anomaly detection algorithm implemented in Python2017-09-12T00:00:00+00:002017-09-12T00:00:00+00:00https://udohsolomon.github.io/machine%20learning/Anomaly-detection<p>This post is an overview of a simple anomaly detection algorithm implemented in Python. While there are different types of anomaly detection algorithms, we will focus on the univariate Gaussian and the multivariate Gaussian normal distribution algorithms in this post.</p>
<p>Anomaly detection problem is the identification of outliers in data points relative to some standard or expected outcome. It has significant wide range of applications in several industries due to the critical and actionable information it provides. Some of the examples of anomaly detection are fraud detection in an online transaction, an unexpected growth of users on a website in a short period of time that looks like a spike, another is a production plant where the sensors may indicate faulty components and monitoring of computer servers in a data centre.</p>
<p>The basic approach of anomaly detection is defining a boundry around the normal data points that separates them from the outliers. However, several factors make this approach very challenging and it’s not very straightforward in itself.</p>
<p>In this post, we will implement anomaly detection algorithm to detect outliers in computer servers in a data centre for monitoring purpose. The Gaussian distribution model is used for this example. First, we will describe the univariate gaussian distribution model, after that we will detailed the multivariate gaussian distribution and lastly, carry out the implementation in Python.</p>
<h2 id="univariate-gaussian-normal-distribution-model">Univariate Gaussian normal distribution model</h2>
<p>Here we present some basic facts regarding the Gaussian normal distribution model. It is commonly expressed in terms of the parameters <script type="math/tex">x</script>, <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>, where <script type="math/tex">x</script> is the feature matrix, <script type="math/tex">\mu</script> is the mean and <script type="math/tex">\sigma</script> is the covariance matrix.</p>
<ol>
<li>Choose features <script type="math/tex">x_i</script> that might be indicative of anomalous examples</li>
<li>
<p>Fit parameters <script type="math/tex">\mu_{1},...,{\mu}_n, {\sigma_1^{2}},...,{\sigma_n^{2}}</script></p>
<script type="math/tex; mode=display">\mu_j = \frac{1}{m}\sum_{i=1}^{m} x_{i}^{j}</script>
<script type="math/tex; mode=display">\sigma_j^{2} = \frac{1}{m}\sum_{i=1}^{m}{(x_j^{i} - \mu_j)}^{2}</script>
</li>
<li>
<p>Given new example <script type="math/tex">x</script>, compute <script type="math/tex">p(x)</script></p>
<script type="math/tex; mode=display">p(x) = \prod_{j=1}^{n}p(x_j;\mu_j,\sigma_j^{2}) = \prod_{j=1}^{n}\frac{1}{\sqrt{2{\pi}}\sigma_j}\exp(-\frac{(x_j - \mu_j)^2}{2\sigma_j^{2}})</script>
</li>
<li>Anomaly if <script type="math/tex">% <![CDATA[
p(x) < {\epsilon} %]]></script></li>
</ol>
<p>For our case study, monitoring computer servers in a data centre let us go through the process of choosing our feature <script type="math/tex">x_i</script>. Normally we may want to choose features that might on unsually large oe small values in the event of an anomaly.
For example, some of the features we may want to choose would be;</p>
<p><script type="math/tex">x_1</script> = memory use of computer</p>
<p><script type="math/tex">x_2</script> = number of disk accesses/sec</p>
<p><script type="math/tex">x_3</script> = CPU load</p>
<p><script type="math/tex">x_4</script> = network traffic</p>
<p>Let us assume we suspect that one of our computers gets stuck in some infinite loop so that the CPU loads grows but the network traffic does not. In this case, to detect such kind of anomaly, we may have to create a new feature <script type="math/tex">x_5</script> such that;</p>
<script type="math/tex; mode=display">x_5 = {\frac{CPU\enspace load}{network\enspace traffic}}</script>
<p>This approach of manually creating our features really helps in detecting anomaly or unusaul combinations of values that would not have possibly been flagged by our algorithm. However, where our algorithm fails to detect an anomaly and we don’t have the luxury of creating new features, how do we go about fixing this?</p>
<p>One possible way of fixing this sort of strange behaviour is to develop a modified version of the Gaussian normal distribution known as the multivariate Gaussian distribution.</p>
<h2 id="multivariate-gaussian-distribution-model">Multivariate Gaussian distribution model</h2>
<p>The multivariate Gaussian distribution is expressed in terms of parameters <script type="math/tex">\mu</script> and <script type="math/tex">\Sigma</script>, where <script type="math/tex">\mu</script> is an <script type="math/tex">n \times 1</script> vector and <script type="math/tex">\Sigma</script> is an <script type="math/tex">n \times n</script>, covariance matrix. The multivariate Gaussian model automatically captures the correlations between features so that we don’t have to manually create them.</p>
<p>Instead of modelling our <script type="math/tex">p(x)</script> separately like we did in Gaussian normal distribution, we are going to model our <script type="math/tex">p(x)</script> all in one goal.</p>
<p>Given parameters: <script type="math/tex">\mu \in {\Re}^n</script> and <script type="math/tex">\Sigma \in {\Re}^{n \times n}</script> (covariance matrix)</p>
<script type="math/tex; mode=display">p(x; \space \mu, \space \Sigma) = {\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}}\exp(\,-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu))\,</script>
<p>To detect anomaly with the multivariate Gaussian distribution, we go through the following;</p>
<ol>
<li>Given training set <script type="math/tex">\{x^{(1)}, x^{(2)}, x^{(3)}, . . . , x^{(m)}\}</script></li>
<li>Fit model <script type="math/tex">p(x)</script> by setting</li>
</ol>
<script type="math/tex; mode=display">\mu = \frac{1}{m}\sum_{i=1}^{m} x^{i}</script>
<script type="math/tex; mode=display">\Sigma = \frac{1}{m}\sum_{i=1}^{m}{(x^{(i)} - \mu)}{(x^{(i)} - \mu)}^{T}</script>
<ol>
<li>Given a new example <script type="math/tex">x</script>, compute</li>
</ol>
<script type="math/tex; mode=display">p(x; \space \mu, \space \Sigma) = {\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}}\exp(\,-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu))\,</script>
<ol>
<li>Flag an anomaly if <script type="math/tex">% <![CDATA[
p(x) < \epsilon %]]></script></li>
</ol>
<p>It is possible to show mathematically that the original Gaussian distribution is thesame as the multivariate Gaussian but with a constrain. The constrain is that the covariance matrix <script type="math/tex">\Sigma</script> must have all zeros on the off diagonal elements of the matrix.</p>
<p>While this approach automatically captures correlations between features, it is computationally more expensive to implement. The Gaussian normal distribution is cheaper and scale better.</p>
<h2 id="python-implementation-of-anomaly-detection-algorithm">Python implementation of anomaly detection algorithm</h2>
<p>The task here is to use the multivariate Gaussian model to detect an if an unlabelled example from our dataset should be flagged an anomaly. To keep things simple, we will only deal with a simple 2-dimensional dataset. The dataset only contains two features; the throughput <script type="math/tex">(mb/s)</script> and latency <script type="math/tex">(ms)</script> response of each server. First let us visualise our dataset and explore what exactly is going on.</p>
<h2 id="data-exploration">Data exploration</h2>
<p>Before we start, we need to explore our dataset, plotting our features will help us have good visual representation and give us better insight of what is going on. To accomplish this, we must first import all the important libraries using Python.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">genfromtxt</span>
<span class="k">from</span> <span class="n">scipy</span><span class="o">-</span><span class="n">stats</span> <span class="kn">import</span> <span class="nn">multivariate_normal</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">f1_score</span>
</code></pre></div></div>
<figure>
<a href="/assets/images/trainfeature.png"><img src="/assets/images/trainfeature.png" /></a>
</figure>
<p>As shown in the figure, we can see the datapoints tightly clustered at the centre with some few points further away from the cluster. Merely looking at the graph, we can easily tell from this simple example those points futher away from the cluster could be considered anomalies. But our goal here is to use the multivariate Gaussian model algorithm to estimate the each feature in the datapoints. For us to achieve this, we may want to define some certain functions that made up our Gaussian distribution, compute the mean and variance for each feature in our dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="c1">#define the function for reading our data
</span><span class="k">def</span> <span class="nf">read_dataset</span><span class="p">(</span><span class="n">filePath</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">','</span><span class="p">):</span>
<span class="k">return</span> <span class="n">genfromtxt</span><span class="p">(</span><span class="n">filePath</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="n">delimiter</span><span class="p">)</span>
<span class="c1">#define paramter for feature normalization
</span><span class="k">def</span> <span class="nf">feature_normalize</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="n">dataset</span> <span class="o">-</span> <span class="n">mu</span><span class="p">)</span> <span class="o">/</span> <span class="n">sigma</span>
<span class="c1">#define the parameter and estimate the Gaussian distribution
</span><span class="k">def</span> <span class="nf">estimate_gaussian</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">dataset</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
<span class="k">return</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span>
<span class="c1">#define the multivariate Gaussian distribution
</span><span class="k">def</span> <span class="nf">multivariate_gaussian</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">):</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mean</span><span class="o">=</span><span class="n">mu</span><span class="p">,</span> <span class="n">cov</span><span class="o">=</span><span class="n">sigma</span><span class="p">)</span>
<span class="k">return</span> <span class="n">p</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="implementing-the-anomaly-detection-algorithm-using-the-gaussian-model">Implementing the anomaly detection algorithm using the Gaussian model</h2>
<p>Next is for us to define a function <script type="math/tex">\epsilon</script> that will help us get the optimal value for the threshold which will be use to separate the normal and the anomalous datapoints. We are going to make use of the cross validation dataset to learn the optimal values of <script type="math/tex">\epsilon</script>. To achieve this, we are going to try different values in a range of learned probabilities. We will then calculate the f1-score for the predicted anomalies based on the ground truth available data.
The <script type="math/tex">\epsilon</script> with the highest value of f1-score will be our threshold. This means that the probability that lie below the selected threshold will be considered anomalous.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">select_threshold</span><span class="p">(</span><span class="n">probs</span><span class="p">,</span> <span class="n">test_data</span><span class="p">):</span>
<span class="n">best_epsilon</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">best_f1</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">f</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">stepsize</span> <span class="o">=</span> <span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">probs</span><span class="p">)</span> <span class="o">-</span> <span class="nb">min</span><span class="p">(</span><span class="n">probs</span><span class="p">))</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">;</span>
<span class="n">epsilons</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">probs</span><span class="p">),</span> <span class="nb">max</span><span class="p">(</span><span class="n">probs</span><span class="p">),</span> <span class="n">stepsize</span><span class="p">)</span>
<span class="k">for</span> <span class="n">epsilon</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">nditer</span><span class="p">(</span><span class="n">epsilons</span><span class="p">):</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="p">(</span><span class="n">probs</span> <span class="o"><</span> <span class="n">epsilon</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">f1_score</span><span class="p">(</span><span class="n">test_data</span><span class="p">,</span> <span class="n">predictions</span><span class="p">,</span> <span class="n">average</span><span class="o">=</span><span class="s">'binary'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">f</span> <span class="o">></span> <span class="n">best_f1</span><span class="p">:</span>
<span class="n">best_f1</span> <span class="o">=</span> <span class="n">f</span>
<span class="n">best_epsilon</span> <span class="o">=</span> <span class="n">epsilon</span>
<span class="k">return</span> <span class="n">best_f1</span><span class="p">,</span> <span class="n">best_epsilon</span>
<span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="n">estimate_gaussian</span><span class="p">(</span><span class="n">train_data</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">multivariate_gaussian</span><span class="p">(</span><span class="n">train_data</span><span class="p">,</span><span class="n">mu</span><span class="p">,</span><span class="n">sigma</span><span class="p">)</span>
<span class="c1">#selecting optimal value of epsilon using cross validation
</span><span class="n">p_cv</span> <span class="o">=</span> <span class="n">multivariate_gaussian</span><span class="p">(</span><span class="n">crossval_data</span><span class="p">,</span><span class="n">mu</span><span class="p">,</span><span class="n">sigma</span><span class="p">)</span>
<span class="n">fscore</span><span class="p">,</span> <span class="n">ep</span> <span class="o">=</span> <span class="n">select_threshold</span><span class="p">(</span><span class="n">p_cv</span><span class="p">,</span><span class="n">test_data</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">fscore</span><span class="p">,</span> <span class="n">ep</span><span class="p">)</span>
<span class="c1">#selecting outlier datapoints
</span><span class="n">outliers</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">p</span> <span class="o"><</span> <span class="n">ep</span><span class="p">))</span>
</code></pre></div></div>
<figure>
<a href="/assets/images/outlier.png"><img src="/assets/images/outlier.png" /></a>
</figure>
<p>Looks pretty cool right? The points in red are the ones that were flagged as the outliers which were not far from what we observed in the first graph when exploring the datapoints. This really makes sense. We have been able to implement a simple anomaly detection algorithm using the Gaussian distribution model. This post was inspired from Andrew Ng Coursera <a href="https://www.coursera.org/learn/machine-learning/home/week/9">machine learning</a> course.</p>
<p>You can get the complete source code I used in implementing this algorithm from my repository <a href="https://github.com/udohsolomon/anomaly-detection/blob/master/anomaly_detection.py">here</a>.</p>Solomon AmosThis post is an overview of a simple anomaly detection algorithm implemented in Python. While there are different types of anomaly detection algorithms, we will focus on the univariate Gaussian and the multivariate Gaussian normal distribution algorithms in this post.Understanding Batch Normalization2017-06-21T00:00:00+00:002017-06-21T00:00:00+00:00https://udohsolomon.github.io/neural%20network/understanding-batch-normalization<p>I took Andrew’s <a href="https://www.coursera.org/learn/neural-networks-deep-learning">Deep Learning course</a> on Coursera, the course teaches you how to effectively design a neural network from scratch. It was extremely useful in my understanding of what is happening behind the scene. The bottom-up approach of the course makes it really interesting. The way Andrew breaks down some of these seemly complex techniques and algorithms in the course is joy to watch. From batch normalisation to mini-batch gradient descent to hyperparameters tuning.</p>
<p>That being said, let us get down to the business we have in hand today. Implementing batch normalization! In our previous post where we looked at implementing neural network from scratch. The first thing we did was to pre-processed our data. That was normalization. We understood that normalizing the input features can speed up learnings. What we did was to compute the mean, the variances and then normalized the data according to the variances.</p>
<script type="math/tex; mode=display">\varphi = \frac{1}{m}\sum_{i=1}^{m} x_{i}</script>
<script type="math/tex; mode=display">X = x_{i} - \varphi</script>
<script type="math/tex; mode=display">\sigma^{2} = \frac{1}{m}\sum_{i=1}^{m}{(x_{i} - \varphi)}^{2}</script>
<script type="math/tex; mode=display">X = \frac{X} {\sigma^{2} }</script>
<p><img src="https://udohsolomon.github.io/assets/images/gradientdescent.png" alt="image-center" class="align-center" /></p>
<p>As we saw in the post, normalizing the input can turn the contours of our learning problem from a very elongated shape to something very more rounded which makes it easier for our gradient descent algorithm to optimize.</p>
<p><img src="https://udohsolomon.github.io/assets/images/NNnetwork.png" alt="image-center" class="align-center" /></p>
<p>Now for a deeper model where we have the input features and several hidden activation layers in our network, and we want to normalize the hidden layers and not just the input features of our model. It would be nice to normalize the mean and variances of the activation function of the previous layer. This will make the training of our paramters more efficient.</p>
<p>For any hidden layer, we can normalize the value of say <script type="math/tex">Z^{[3]}</script> in hidden layer <script type="math/tex">L^{4}</script> so as to train the values of <script type="math/tex">W^{[4]}, b^{[4]}</script> faster and make our model more efficient.</p>
<p>Considering our deep learning example from the perspective of a certain layer, say the 4th hidden layer, <script type="math/tex">L^{4}</script>. So our network has to learn the parameters <script type="math/tex">W^{[4]}, b^{[4]}</script>. From the perspective of this 4th hidden layer, each of the preceeding layers have great influence over what the input of this layer will see. As you start to train your network, the distribution of what this layer sees will vary significantly with time. As an analogy, let us say you train your dataset on all images of black cats, if you try to apply this same network to dataset with coloured cats where the positive examples are not just black cats, then your classifier or prediction will perform poorly.
This concept where the training dataset distribution is different from the text dataset distribution is known as <script type="math/tex">\textbf {covariate shift}</script>. The idea is that if you’ve learned some <script type="math/tex">X</script> to <script type="math/tex">Y</script> mapping, <script type="math/tex">X \rightarrow Y</script>, and at any time the distribution of <script type="math/tex">X</script> changes, then you might need to retrain your learning algorithm.</p>
<p>So what batch norm does is to reduce the the amount that the distribution of the hidden units shift around. No matter how the parameter of the previous layer changes, their mean and variance will remain thesame. It limits the amount to which updating the parameters of the earlier layers can affect the distribution of values that the current layer now sees and therefore has to learn on. This makes the values of the current layer to become more stable and provide the later layers more firm ground to stand on. Another interesting thing about batch norm is that it has a slight regularization effect. Though this is really not the intent of batch norm but sometimes it has this intended or unintended effect on your learning algorithm.</p>
<h2 id="implementing-batch-normalization">Implementing Batch Normalization</h2>
<p>Given some intermediate values in a neural network, we can add batch norm by first, feeding the input <script type="math/tex">X</script> into the first hidden layer <script type="math/tex">L^{1}</script> and then compute <script type="math/tex">Z^{[1]}</script> govern by the parameters <script type="math/tex">W^{[1]}, b^{[1]}</script>. We then take this parameter <script type="math/tex">Z^{[1]}</script> and apply batch norm govern by the parameters <script type="math/tex">{\beta^{[1]}}</script> and <script type="math/tex">{\gamma^{[1]}}</script>. This will give us the new normalized value <script type="math/tex">{\hat{Z}^{[1]}}</script>, which we then feed into the activation function to give us <script type="math/tex">{a}^{[1]} = {g}^{[1]}{({\hat{Z}^{[1]}})}</script>.</p>
<script type="math/tex; mode=display">{X} \xrightarrow{W^{[1]}, b^{[1]}} {Z^{[1]}} \xrightarrow[Batch Norm (BN)]{\beta^{[1]}, \gamma^{[1]}} {\hat{Z}^{[1]}}\rightarrow{a}^{[1]} = {g}^{[1]}{({\hat{Z}^{[1]}})}</script>
<p>Now we’ve done the computation for the first hidden layer <script type="math/tex">L^{1}</script>. Next, we take the value <script type="math/tex">{a}^{[1]}</script> and use it to compute the batch norm for the next hidden layer and so on.</p>
<script type="math/tex; mode=display">{a}^{[1]} \xrightarrow{W^{[2]}, b^{[2]}} {Z^{[2]}} \xrightarrow[BN]{\beta^{[2]}, \gamma^{[2]}} {\hat{Z}^{[2]}}\rightarrow{a}^{[2]}</script>
<p>With these new set of parameters in our algorithm, we can then use whatever optimization technique we want to. So far, we’ve been talking about batch norm as if we were training on the entire training set at thesame time, like we’re using batch gradient descent. However, It is worth noting that in practice, batch normalization is usually applied with mini-batches in training set.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">t</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">...</span> <span class="n">n</span><span class="p">(</span><span class="n">mini</span><span class="o">-</span><span class="n">batches</span><span class="p">)</span>
<span class="n">compute</span> <span class="n">forward</span> <span class="n">propagation</span> <span class="n">on</span> <span class="no">X</span><span class="p">{</span><span class="n">t</span><span class="p">}</span>
<span class="k">in</span> <span class="n">each</span> <span class="n">hidden</span> <span class="n">layer</span><span class="p">,</span> <span class="n">use</span> <span class="no">BN</span> <span class="n">to</span> <span class="n">replace</span> <span class="no">Z</span><span class="p">[</span><span class="n">l</span><span class="p">]</span> <span class="n">with</span> <span class="n">z</span><span class="o">^</span><span class="p">[</span><span class="n">l</span><span class="p">]</span>
<span class="n">compute</span> <span class="n">dW</span><span class="p">[</span><span class="n">l</span><span class="p">],</span> <span class="n">dbeta</span><span class="p">,</span> <span class="n">dgamma</span> <span class="n">using</span> <span class="n">backward</span> <span class="n">propagation</span>
<span class="n">update</span> <span class="n">parameters</span> <span class="n">dW</span><span class="p">[</span><span class="n">l</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">dW</span><span class="p">[</span><span class="n">l</span><span class="p">]</span> <span class="o">-</span> <span class="n">alpha</span><span class="p">.</span><span class="nf">dW</span><span class="p">[</span><span class="n">l</span><span class="p">]</span>
<span class="n">dbeta</span> <span class="p">:</span><span class="o">=</span> <span class="n">dbeta</span> <span class="o">-</span> <span class="n">alpha</span><span class="p">.</span><span class="nf">dbeta</span>
<span class="n">dgamma</span><span class="p">:</span><span class="o">=</span> <span class="n">dgamma</span><span class="o">-</span> <span class="n">alpha</span><span class="p">.</span><span class="nf">dgamma</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Also checking out this maths function witk LateX:</p>
<script type="math/tex; mode=display">\sum_{i=1}^{\infty} \frac{1}{n^s}
= \prod_p \frac{1}{1 - p^{-s}}</script>
<script type="math/tex; mode=display">a_{(1)}^{[2]} + a_{(2)}^{[2]} = a_{(3)}^{[2]}</script>
<script type="math/tex; mode=display">y = {\sqrt{x^2+(x-1)} \over x-3} + \left| 2x \over x^{0.5x} \right|</script>
<script type="math/tex; mode=display">\dfrac{x^2}{x^3} = \dfrac{1}{x}</script>
<script type="math/tex; mode=display">% <![CDATA[
W^{1} = \begin{bmatrix}
0.01 & 0.05 & 0.07 \\
0.20 & 0.041 & 0.11 \\
0.04 & 0.56 & 0.13
\end{bmatrix} %]]></script>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
x_{11} & x_{12} & x_{13} & \dots & x_{1n} \\
x_{21} & x_{22} & x_{23} & \dots & x_{2n} \\
\dots &\dots &\dots & \ddots &\dots \\
x_{d1} & x_{d2} & x_{d3} & \dots & x_{dn}
\end{bmatrix}
=
\begin{bmatrix}
x_{11} & x_{12} & x_{13} & \dots & x_{1n} \\
x_{21} & x_{22} & x_{23} & \dots & x_{2n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{d1} & x_{d2} & x_{d3} & \dots & x_{dn}
\end{bmatrix} %]]></script>
<p>Check out the <a href="http://jekyllrb.com/docs/home">Jekyll docs</a> for more info on how to get the most out of Jekyll. File all bugs/feature requests at <a href="https://github.com/jekyll/jekyll">Jekyll’s GitHub repo</a>. If you have questions, you can ask them on <a href="https://talk.jekyllrb.com/">Jekyll Talk</a>.</p>Solomon AmosI took Andrew’s Deep Learning course on Coursera, the course teaches you how to effectively design a neural network from scratch. It was extremely useful in my understanding of what is happening behind the scene. The bottom-up approach of the course makes it really interesting. The way Andrew breaks down some of these seemly complex techniques and algorithms in the course is joy to watch. From batch normalisation to mini-batch gradient descent to hyperparameters tuning.Implementing neural network from scratch!2016-02-24T00:00:00+00:002016-02-24T00:00:00+00:00https://udohsolomon.github.io/neural%20network/implementing-neural-network-from-scratch<p>In this post we will try to understand how a neural network works by implementing it completely from scratch. You’ll get to understand what really goes on behind the scene of this network. At the end of this post, you’ll not only understand the concept of a neural network, but you will be able to implement one yourself completely from scratch. We are going to achieve this by looking at a practical example as a fun project.
I’m going to try as much as possible to breakdown some of the concepts as much as possible. You don’t need to be a maths genius to effectively understand these concepts.</p>
<p>In a neural network, we have the input layer, the hidden layer and the output layer. The input layer consists of features known as the input features. In our example, they are represented as <script type="math/tex">{x_1}</script>, <script type="math/tex">{x_2}</script> and <script type="math/tex">{x_3}</script> which are fed into the hidden layer. The hidden layer on the other hand consists of various nodes. It is termed hidden layer because the true values for the nodes in the middle are not observe in the training set. In other words, we don’t see what they should be in the training set. The output layer is responsible for generating the predicted value <script type="math/tex">{\hat{y}}</script>.</p>
<p>We will introduce a concept called the <script type="math/tex">\textbf{activations}</script>. These are the values that different layers of the network are passing on to the subsequent layers. This means that the input layer passes on the value <script type="math/tex">X</script> to the hidden layer. This is called the <script type="math/tex">\textbf{activations}</script> of the input layer denoted as <script type="math/tex">a^{[0]}</script>. The hidden layer will inturn generate some form of activations denoted as <script type="math/tex">{a}^{[1]}</script>. So the first node in the hidden layer will generate a value <script type="math/tex">{a}_{1}^{[1]}</script>, the second node will generate the value <script type="math/tex">{a}_{1}^{[1]}</script> and so on till it gets to <script type="math/tex">{a}_{n}^{[2]}</script> which indicates the last node of that hidden layer. Finally, the output layer will generate a value <script type="math/tex">{a}^{[2]}</script> which is just a real number. So <script type="math/tex">{\hat{y}}</script>, our target out will take the value of <script type="math/tex">{a}^{[2]}</script>, <script type="math/tex">{a}^{[2]} = {\hat{y}}</script>.</p>
<p>In neural network representation, the input layer is not usually counted. For example, our network representation above is called a 2-layer neural network. The hidden and the output layers have some other parameters they are associated with, these parameters are <script type="math/tex">W</script> and <script type="math/tex">b</script>. In our example <script type="math/tex">Z=WX + b</script>, where <script type="math/tex">W</script> is a matrix, <script type="math/tex">b</script> is called the bias. Not to confuse you too much at this stage, we will delve into this parameters later in this post.
Now that you have a better understanding of a neural network representation, we are now going to see what these different layers are computing. The first is <script type="math/tex">Z_1^{[1]} = W_1^{[1]T}{X}+b_1^{[1]}</script> and <script type="math/tex">a_1^{[1]} = {\sigma}(Z_1^{[1]})</script>. For both <script type="math/tex">Z</script> and <script type="math/tex">a</script>, the notation convention is <script type="math/tex">a_i^{[l]}</script>, where <script type="math/tex">[l]</script> denotes the layer number and <script type="math/tex">(i)</script> denotes the node in that layer.
For a neural network representation of our example, we will have the following set of equations;</p>
<p><script type="math/tex">Z_1^{[1]} = W_1^{[1]T}{X}+b_1^{[1]}</script>, <script type="math/tex">a_1^{[1]} = {\sigma}(Z_1^{[1]})</script></p>
<p><script type="math/tex">Z_2^{[1]} = W_2^{[1]T}{X}+b_2^{[1]}</script>, <script type="math/tex">a_2^{[1]} = {\sigma}(Z_3^{[1]})</script></p>
<p><script type="math/tex">Z_3^{[1]} = W_3^{[1]T}{X}+b_3^{[1]}</script>, <script type="math/tex">a_3^{[1]} = {\sigma}(Z_3^{[1]})</script></p>
<p>We will start by taking the values of <script type="math/tex">Z</script> and vectorize them. First let us take these values and stack them into a matrix.</p>
<p><script type="math/tex">% <![CDATA[
\begin{bmatrix}
\dots & W_1^{[1]T} & \dots \\
\dots & W_2^{[1]T} & \dots \\
\dots & W_3^{[1]T} & \dots
\end{bmatrix} %]]></script>
<script type="math/tex">\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}</script>
+
<script type="math/tex">\begin{bmatrix}
b_1^{[1]}\\
b_2^{[1]}\\
b_3^{[1]}
\end{bmatrix}</script></p>
<p>We can then compute this matrix to have <script type="math/tex">Z^{[1]}</script> as follows</p>
<p><script type="math/tex">Z^{[1]} =
\begin{bmatrix}
W_1^{[1]T} + b_1^{[1]}\\
W_2^{[1]T} + b_2^{[1]}\\
W_3^{[1]T} + b_3^{[1]}
\end{bmatrix}</script>
=
<script type="math/tex">\begin{bmatrix}
Z_1^{[1]}\\
Z_2^{[1]}\\
Z_3^{[1]}
\end{bmatrix}</script></p>
<p>Similarly, we take the elements of <script type="math/tex">a^{[l]}</script> and stack them together to obtain our <script type="math/tex">a^{[1]}</script>;</p>
<p><script type="math/tex">a^{[1]}
\begin{bmatrix}
a_1^{[1]}\\
a_2^{[1]}\\
a_3^{[1]}
\end{bmatrix}</script>
=
<script type="math/tex">{\sigma}(Z_1^{[1]})</script></p>
<p>where <script type="math/tex">{\sigma}</script> is the sigmoid function that takes in the four elements of <script type="math/tex">Z</script> and applies the sigmoid function elementwise to it.
From our example, it is worth nothing that the dimensions of <script type="math/tex">W</script> is a (3*3) matrix, while that of <script type="math/tex">X</script> is a (3X1) matrix and <script type="math/tex">b</script> is a (3X1) matrix.</p>
<h2 id="vectorizing-across-multiple-examples">Vectorizing across multiple examples</h2>
<p>So far what we have been doing is considering a single training example where for a given input wan can predict the output <script type="math/tex">{\hat{y}}</script>. For most practical applications in machine learning, there are always large number of training examples. For the purpose of this post, let us call it <script type="math/tex">m</script> training examples. Now for <script type="math/tex">m</script> number of training examples, we need to repeat the process of predicting the output <script type="math/tex">{\hat{y}}</script></p>
<p>For large number of training examples, we repeat each steps as follow</p>
<p>for i=1 to m:</p>
<p><script type="math/tex">Z_1^{[1](1)} = W_1^{[1]T}{X^{(1)}}+b_1^{[1]}</script>, <script type="math/tex">a_1^{[1](1)} = {\sigma}(Z_1^{[1](1)})</script></p>
<p><script type="math/tex">Z_2^{[1](2)} = W_2^{[1]T}{X^{(2)}}+b_2^{[1]}</script>, <script type="math/tex">a_2^{[1](2)} = {\sigma}(Z_3^{[1](2)})</script></p>
<h2 id="activation-functions">Activation functions</h2>
<p>The activation function is a non linear function that allows our network to compute complicated problems using only small number of nodes. When working with neural network, one of the choices you get to make is what activation function to use. The activation function is applied elementwise. The different types and the most common are the sigmoid, tanh, ReLU, Leaky ReLU, and maxout. The ReLU is know as the Rectified Linear Unit, it has a very simple shape and it is the commonly used activation function.</p>
<h1 id="fun-example-predicting-car-prices">Fun example: Predicting car prices</h1>
<p>With the little background and knowledge you have got so far, let us see how we can apply neural network in prediction the price of a car. We decided to predict the prices of cars since this is a practical example most people can relate with. For this example, we are going to only consider one model of car. The car brand we’re considering is Toyota Camry with the following features: The age of the car, the Km travelled, the fuel type and we are going to predict the price of the car. Due to the limited data we have and some important feautures missing, our model will certainly not be perfect since these important missing features also impact the prices of cars. However, the idea is to use example and dataset most people can relate with and also to keep things very simple.</p>
<h2 id="data-exploration">Data exploration</h2>
<p>Before we start our predictive analyses using neural network, let us first take a visual look of our data and try to understand how each of the features are distrubuted against the price. Here is the first 5 lines of the Toyota camry file that shows us what the data actually look like.</p>
<h2 id="data-normalization">Data normalization</h2>
<p>One of the most important steps in machine learning is called <script type="math/tex">{\textbf{normalization}}</script>, it is also known as feauture scaling or in simple term data preprocessing. Since our neural network model only works with numbers, the idea of feauture scaling is to convert the data to almost thesame scale and as small as possible. Normalizing the input features can speed up learning by making computation faster.</p>
<p>In our example, the km travelled are between 0 and 350km, the fuel type is binary diesel/petrol, the age is between 0 and 40 and the price ranges between $500 and $40k.
We are going to normalize the km travelled and the age using mean and variance in order to bring them to thesame scale. Since the fuel type is binary, we’re going to transform it to values of -1 and +1. Since what we are going to be predicting is the price and the output of our neural network is going to be between 0 and 1, it will be good for us to normalize this between [0, 1].
For both the km travelled and the age, the normalized equations we will be using are expressed as follow;</p>
<script type="math/tex; mode=display">\varphi = \frac{1}{m}\sum_{i=1}^{m} x_{i}</script>
<script type="math/tex; mode=display">X = x_{i} - \varphi</script>
<script type="math/tex; mode=display">\sigma^{2} = \frac{1}{m}\sum_{i=1}^{m}{(x_{i} - \varphi)}^{2}</script>
<script type="math/tex; mode=display">X_norm = \frac{X} {\sigma^{2} }</script>
<p>The formular for the normalized car price is expressed as:</p>
<script type="math/tex; mode=display">\frac{x_i - min(x)}{max(x) - min(x)}</script>
<p>where <script type="math/tex">x_i</script> is the individual car price and <script type="math/tex">x</script> is the car prices in our dataset.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">print_hi</span><span class="p">(</span><span class="nb">name</span><span class="p">)</span>
<span class="nb">puts</span> <span class="s2">"Hi, </span><span class="si">#{</span><span class="nb">name</span><span class="si">}</span><span class="s2">"</span>
<span class="k">end</span>
<span class="n">print_hi</span><span class="p">(</span><span class="s1">'Tom'</span><span class="p">)</span>
<span class="c1">#=> prints 'Hi, Tom' to STDOUT.</span>
</code></pre></div></div>
<p>Also checking out this maths function witk LateX:</p>
<script type="math/tex; mode=display">\sum_{i=1}^{\infty} \frac{1}{n^s}
= \prod_p \frac{1}{1 - p^{-s}}</script>
<script type="math/tex; mode=display">a_{(1)}^{[2]} + a_{(2)}^{[2]} = a_{(3)}^{[2]}</script>
<script type="math/tex; mode=display">y = {\sqrt{x^2+(x-1)} \over x-3} + \left| 2x \over x^{0.5x} \right|</script>
<script type="math/tex; mode=display">\dfrac{x^2}{x^3} = \dfrac{1}{x}</script>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
x_{11} & x_{12} & x_{13} & \dots & x_{1n} \\
x_{21} & x_{22} & x_{23} & \dots & x_{2n} \\
\dots &\dots &\dots & \ddots &\dots \\
x_{d1} & x_{d2} & x_{d3} & \dots & x_{dn}
\end{bmatrix}
=
\begin{bmatrix}
x_{11} & x_{12} & x_{13} & \dots & x_{1n} \\
x_{21} & x_{22} & x_{23} & \dots & x_{2n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{d1} & x_{d2} & x_{d3} & \dots & x_{dn}
\end{bmatrix} %]]></script>
<p>Check out the <a href="http://jekyllrb.com/docs/home">Jekyll docs</a> for more info on how to get the most out of Jekyll. File all bugs/feature requests at <a href="https://github.com/jekyll/jekyll">Jekyll’s GitHub repo</a>. If you have questions, you can ask them on <a href="https://talk.jekyllrb.com/">Jekyll Talk</a>.</p>Solomon AmosIn this post we will try to understand how a neural network works by implementing it completely from scratch. You’ll get to understand what really goes on behind the scene of this network. At the end of this post, you’ll not only understand the concept of a neural network, but you will be able to implement one yourself completely from scratch. We are going to achieve this by looking at a practical example as a fun project. I’m going to try as much as possible to breakdown some of the concepts as much as possible. You don’t need to be a maths genius to effectively understand these concepts.