As capital markets evolve, data science is becoming more prominent to scale the operations. In this article, I will talk about how data science is currently used.

Photo by Sean Pollock on Unsplash

More businesses today are riding the Big Data analytics bandwagon with the objectives of converting insights — -gleaned from huge piles of data — into genuine business advantage. In the retail banking space, unstructured data collected from a broad range of social media sources has resulted in advanced customer profiling and in-depth analytics that in turn are helping enhance customer loyalty and experiences. However, in capital markets so far, firms have traditionally dealt with structured data sets from limited and pre-defined sources. Big data strategies have now begun to impact a select few areas in capital markets firms over the…


Time is not a line, but a series of now-points. Taisen Deshimaru

By the end of this article, you will be able to:

  • Describe time series Analysis, its components, and when to use it
  • Perform exploratory data analysis on time-series data
  • Perform time Series decomposition

Time Series

What is Time Series Analysis

It is a statistical technique that deals with time-series data, or trend analysis.

In a time series, time is often the independent variable and the goal is usually to make a forecast for the future

  • Predicting future values of a variable based on past data
  • To build forecasting models
  • Most common example — Stock market predictions, very challenging because there are many intertwining factors. …


Clustering

K-means Clustering

… Does this look familiar?

It kinda looks like slicing up variance AKA sums-of-squares.

Examples in R

www.r-blogger.com/exploring-assumptions-of-k-means-clusterin-using-r/

www.r-bloggers.com/k-means-clustering-is-not-a-free-lunch/

Assumptions:

  • clusters are spherical
  • clusters are similar sizes

Reporting:

  • We ran a k-means clustering using 2 clusters. Two clusters were chosen by the “elbow method”

Choosing a model


By the end of this article, you should be able to plan, apply, diagnose and report logistic regression models using R.

Scenario

A man was admitted into a hospital for stomach pain. An x-ray revealed a shadow of an eel in his stomach (he had neglected to mention that he had inserted an eel into his digestive tract in order to cure constipation). So the question is: can the treatment of inserting an eel into one’s digestive tract cure constipation?

Read in Data

Check Variables


Black swan points?

Outliers and influential points

Outliers and influential points can have a serious impact on your model but, as we’ve discussed, the definition of an “outlier” depends on the context. When looking at data overall, you will likely want to check for errors in data entry, etc. But when working with a specific model, you may have justification to remove some data to create a more accurate model about the typical cases…at the cost of some generalizability.

With regression models, we have a couple of ways to figure this out

Outliers

With regression models, residuals that are far away from 0 have increased influence when using…


What if you have confounding variables? Is it correlated because of another variable and how do we account for that?

Testing for Exam Anxiety

Scenario

A psychologist was interested in the effect of exam stress and revision (aka, studying) time for an exam.

She devised and validated a questionnaire to assess state anxiety relating to exams — the Exam Anxiety Questionnaire (EAQ), which produces a measure of anxiety out of 100.

Anxiety was measured before an exam, as was the score of each student, and the number of hours each student spent revising (studying). (from DSUR 4.5 and 6.5)

What do we do first?

Import the data

Let’s get the data

Testing correlation

You can use `cor.test()` to test for correlation


Getting more out of what was given or currently have

By the end of this article, you will be able to:

  • describe bootstrapping as a technique and explain how confidence intervals are computed using bootstrapping
  • calculate bootstrapped correlations and confidence intervals using R
    define a function in R
  • calculate, interpret, and report partial correlation using R

Non-parametric tests

Another data set

Last time, we saw another data set — a measure of how creative people are and their position (i.e., rank) in a “greatest liar” competition.


Does not mean causation

By the end of this article, you will be able to:

  • suggest general ways to deal with missing data
  • explain correlation and its relationship to causality
  • calculate correlation using Pearson’s r, Spearman’s rho, Kendall’s tau, and test them for significance
  • calculate Pearson’s r confidence intervals
  • check assumptions of Pearson’s r and suggest which correlation measure to use

Missing Data

Scenario

A biologist was worried about the potential health effects of music festivals.

She went to the Download Music Festival and measured the hygiene of 810 concert-goers over the three days of the festival.

Hygiene was measured using a standardized technique yield a score…


Testing distribution of your data

Shapiro-Wilk normality test

The shapiro-wilk test tends to be quite powerful. So with large samples, the test can be significant even when the scores are only slightly different from a normal distribution.

→ Use histograms, Q-Q plots, and values of skewness/ kurtosis to double-check.

The Shapiro-Wilk test found that the hygiene scores on Day 1 were significantly non-normal at the 5% level of significance (W=0.99591, p<0.05).

However, inspection of Q-Q plotss and skewness and kurtosis suggested that the data follow a normal distribution; we continued analysis with the assumption of normality.

So when testing for normality, you should look at Q-Q plots, histogram…


Comparing two models

Hierarchical methods

In hierarchical regression, we first put in all predictors that have a priori reason to be there. This can come from previous work, the problem statement, or the experimenter’s own hypotheses and questions. Then, we add other possible predictors to the base model. Additional predictors can be added all at once, using other methods (like stepwise or all subsets), or using additional theoretical reasons.

Forced-entry methods

In forced-entry methods, we start with all predictors in the model simultaneously. This means there is no order in adding predictors into the model. …

Anh Le

Co-founder of Blossom Research Capital | Data Scientist at TMX Group | Chess Player

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store