More businesses today are riding the Big Data analytics bandwagon with the objectives of converting insights — -gleaned from huge piles of data — into genuine business advantage. In the retail banking space, unstructured data collected from a broad range of social media sources has resulted in advanced customer profiling and in-depth analytics that in turn are helping enhance customer loyalty and experiences. However, in capital markets so far, firms have traditionally dealt with structured data sets from limited and pre-defined sources. Big data strategies have now begun to impact a select few areas in capital markets firms over the…

By the end of this article, you will be able to:

- Describe time series Analysis, its components, and when to use it
- Perform exploratory data analysis on time-series data
- Perform time Series decomposition

What is Time Series Analysis

It is a statistical technique that deals with time-series data, or trend analysis.

In a time series, time is often the independent variable and the goal is usually to make a forecast for the future

- Predicting future values of a variable based on past data
- To build forecasting models
- Most common example — Stock market predictions, very challenging because there are many intertwining factors. …

… Does this look familiar?

It kinda looks like slicing up variance AKA sums-of-squares.

www.r-blogger.com/exploring-assumptions-of-k-means-clusterin-using-r/

www.r-bloggers.com/k-means-clustering-is-not-a-free-lunch/

- clusters are spherical
- clusters are similar sizes

- We ran a k-means clustering using 2 clusters. Two clusters were chosen by the “elbow method”

By the end of this article, you should be able to plan, apply, diagnose and report logistic regression models using R.

A man was admitted into a hospital for stomach pain. An x-ray revealed a shadow of an eel in his stomach (he had neglected to mention that he had inserted an eel into his digestive tract in order to cure constipation). So the question is: can the treatment of inserting an eel into one’s digestive tract cure constipation?

Outliers and influential points can have a serious impact on your model but, as we’ve discussed, the definition of an “outlier” depends on the context. When looking at data overall, you will likely want to check for errors in data entry, etc. But when working with a specific model, you may have justification to remove some data to create a more accurate model about the typical cases…at the cost of some generalizability.

With regression models, we have a couple of ways to figure this out

With regression models, residuals that are far away from 0 have increased influence when using…

A psychologist was interested in the effect of exam stress and revision (aka, studying) time for an exam.

She devised and validated a questionnaire to assess state anxiety relating to exams — the Exam Anxiety Questionnaire (EAQ), which produces a measure of anxiety out of 100.

Anxiety was measured before an exam, as was the score of each student, and the number of hours each student spent revising (studying). (from DSUR 4.5 and 6.5)

**What do we do first?**

Let’s get the data

You can use `cor.test()` to test for correlation

By the end of this article, you will be able to:

- describe bootstrapping as a technique and explain how confidence intervals are computed using bootstrapping
- calculate bootstrapped correlations and confidence intervals using R

define a function in R - calculate, interpret, and report partial correlation using R

Non-parametric tests

Last time, we saw another data set — a measure of how creative people are and their position (i.e., rank) in a “greatest liar” competition.

By the end of this article, you will be able to:

- suggest general ways to deal with missing data
- explain correlation and its relationship to causality
- calculate correlation using Pearson’s r, Spearman’s rho, Kendall’s tau, and test them for significance
- calculate Pearson’s r confidence intervals
- check assumptions of Pearson’s r and suggest which correlation measure to use

A biologist was worried about the potential health effects of music festivals.

She went to the Download Music Festival and measured the hygiene of 810 concert-goers over the three days of the festival.

Hygiene was measured using a standardized technique yield a score…

The shapiro-wilk test tends to be quite powerful. So with large samples, the test can be significant even when the scores are only slightly different from a normal distribution.

→ Use histograms, Q-Q plots, and values of skewness/ kurtosis to double-check.

The Shapiro-Wilk test found that the hygiene scores on Day 1 were significantly non-normal at the 5% level of significance (W=0.99591, p<0.05).

**However, **inspection of Q-Q plotss and skewness and kurtosis suggested that the data follow a normal distribution; we continued analysis with the assumption of normality.

So when testing for normality, you should look at Q-Q plots, histogram…

In hierarchical regression, we first put in all predictors that have a priori reason to be there. This can come from previous work, the problem statement, or the experimenter’s own hypotheses and questions. Then, we add other possible predictors to the base model. Additional predictors can be added all at once, using other methods (like stepwise or all subsets), or using additional theoretical reasons.

In forced-entry methods, we start with all predictors in the model simultaneously. This means there is no order in adding predictors into the model. …

Co-founder of Blossom Research Capital | Data Scientist at TMX Group | Chess Player