Data Science | 𝚃𝚛𝚊𝚗𝚜𝚙𝚘𝚗𝚜𝚝𝚎𝚛

Data Dashboard for StockX Contest

Wed, 21 Aug 2019 21:13:14 -0500

StockX Data Contest 2019

StockX Challenge is a call for data and sneakers nerds to have fun.

source: stockX

The basic idea is this: they give you a bunch of original StockX sneaker data, then you crunch the numbers and come up with the coolest, smartest, most compelling story you can tell. It can be literally anything you want. A theory, an insight, even just a really original data visualization. It could be a novel hypothesis about resale prices you’ve always wanted to test. Or maybe it’s just a beautiful chart to visualize the data. It can be on any subject – sneakers, brands, buyers, or even StockX itself. Whatever you find interesting, just follow your bliss.

I also gave a shot on trying to come up with something useful. Below is my finished data dashboard.

My Data Dashboard for StockX

Dashboard

The link for tableau worksheet is here

Calculations on the Dashboards

Price ratio: Ratio of Sales to Retail Price for Each Sneakers
Weeks: (Order Date - Release Date) Converted in Weeks.
Median Price ratio is chosen to eliminate the effect of asymmetrical range of dates(2017-2019 not complete as 2018) and counts of sneakers sales.
Color Scale for two brands are consistent whenever there is plot relating to brands.

1) Order of Sneakers by brand for weeks from Release Date

This plot shows the total count of orders for different sneakers of two brands Both Brands are ordered before the release date. Off white has more orders than yeezy on the datasets.

It’s interesting how the demand of yeezy increased at around `90 weeks` after the release of the shoes.

2)Ratio of Sales Price to Retail Price For each Brand by Weeks

This plot look at the relation of ratio of sale price to retail price for each brands and weeks after release date. Clearly,Both Brand’s sale price is more than the retail price. The ratio of off-White increases in general regardless of the individual sneakers while the ratio of yeezy brands is somewhat noisy but it has a trend like off white. Both brand’s price ratio is increased after the release date.

3)Distribution of Median Sales price given the retail price for each brand
This plot looks in detail on how the median sale price is distributed for each sneaker. The distribution of median sale price for top 28 sneakers which were sold as least as 5 times over retail price are plotted.

4) Median Price and States

This plot is looking at the median price ratio for all the states. The color scale is chosen for the ratio and the size of the sneakers shows total sales relative to others. Which states usually pays more for sneakers? Clearly, Delaware,Vermont,Utah had some sales with high price ratio. States like California and Newyork have a lot of sales as shown by their relative sizes. The relative size is calculated by taking the log of total sales in each states. States like Wyoming have less Sales and also with lower sales ratio.

Winona Area Public Schools: Community Contribution

Wed, 21 Aug 2019 21:13:14 -0500

Winona Area Public Schools Data Visualization

Introduction:
This Project addresses the need of communication of public school data to community members in an meaningful way.Also, making the data available to general public in a proper and useable format.

There has been a wider discussion regarding the budget issue in Winona area schools. Here is the article

Primarily, this Project was focused on cleaning and visualizing the Enrollment,Expenditures and Staffing History reports of the Winona Area Public District(WAPS) available publicly through Minnesota department of education, Data Center Link:http://education.state.mn.us/MDE/Data/

Methods and Steps of Projects

1)Data Inspection/Acquisition:.
Public Data was collected by Alison Quam (Representative from WAPS District). The Data were made available in different pdf/excel files. Also, the information were scattered in different files.

2)Data Cleaning and Formatting
First,most of the pdf files were converted to excel by Tabula(Link:http://tabula.technology/) and online tool(http://pdftoexcel.com) then, they were cleaned up in proper format and stacked using Python (Pandas).

3)Data Exploration and Visualization
This part of the project is focused on addressing the questions provided by representative of WAPS(Alison Quam). Tableau was used extensively to explore the data and visualize it. Primarily, i focused on answering following questions.
1. I was curious about,how does the enrollment and capture rate(rate of new born enrolling to Kindergarten)is changing on WAPS district?.

After few meetings with representative, i realized she was more curious about how schools spends on across different programs.

2.How the expenditure per average daily membership (count of student daily served in schools) and spending on various category is changing?.

The link to the tableau file and the data is here

Now, Visual Story Begins….

This project actually helped inform the decision makers in local level. Thus, i was able to contribute to something meaningful with my python and tableau skills.

Acknowledgement

I would like to thank WAPS representative and Prof.Silas Bergen on helping and guiding me to understand the terms and calculations already done in the reports and Prof.Todd Iverson to help figure out Python code for cleaning the data.

Animation:Internet Usage

Thu, 15 Aug 2019 21:13:14 -0500

How internet is eating the world? Internet Usage animation

Internet Usage is the world bank development indicator. In this project i grabbed the world bank dataset(which is in the link provided below).

Link to the tableau worksheet

Testing Alcohol level

Wed, 23 May 2018 21:13:14 -0500

Is there really 5.4% alcohol in that beer brand?

We all see that a lot of brand publish on their wrapper that the alcohol level is 5.4%. Let’s say we collected the percent level of volume for those brand. We sampled randomly and measured the alcohol level ourselves

So we believe that the actual beer percent should be 5.4% but as a beer consumer, we feel sometime it’s not.

if we measure one, and found out that beer has 6.7 we would immediately complain that the brand is telling us lie that there is 5.4% . They may argue that our measuring apparatus or technique is not 100% accurate. There is no way of finding our inaccurate our measurement without measuring it multiple times or taking measurement of multiple beers. It might be the case that our measurement is 100% accurate and the beer has more alcohol than the company is saying. We don’t really know. Also, we can’t measure every single beer they ever manufactured. This is the perfect timing to test this with our statistics sense, Below we have a list of measurements from different beer randomly bought, some from midtown, some from walmart. Let’s do a t-test.

level = c(5.1,5.2,6,7,5.01,5.0,6.5,5.6,5.2,6.1,6.2,5.0)
t.test(level, mu = 5.4)

## 
##  One Sample t-test
## 
## data:  level
## t = 1.3139, df = 11, p-value = 0.2156
## alternative hypothesis: true mean is not equal to 5.4
## 95 percent confidence interval:
##  5.225010 6.093323
## sample estimates:
## mean of x 
##  5.659167

The p-value is greater than 0.05 and confidence interval [5.17 to 6.17]. Which means if 100 people have done this random sampling of beer and have calculated the confidence interval , then the mean[5.4] would have always fall in the confidence interval.

Enough with the statistical jargon? Okay let’s enjoy the beer

Verifying empirical rule and Chebyshev's theorem

Mon, 01 Jan 0001 00:00:00 +0000

Empirical rule and Chebyshev’s theorem

Let’s talk about this really simple concept but powerful one. Data Distributions. A data distribution is an abstract concept(a function) that gives the the possible values of data and also how often that data is generated. When you want to talk about the all the data of your experiments at once, then talk about data distribution. A data distribution gives us the probability of how often that data will be an output if we keep repeating the experiment.

We rarely have the complete dataset from the experiment.So, it is powerful to have the an idea of how data is distributed and which data occurs more often than others. We can intuitively understand some distributions like the height of the populations. We know there will be few people with really short height while few have more height. But we are sure that most of the people will be in between.This is really convienient for us to know in advance the spread and frequency of the data.

Interesting thing is that there are more than one kinds of distributions in the world. So the convienience if knowing in advance the spread of the data will be helpful. There is a famous theorem that givrs us an idea of how our data is distributed. It’s called Chebyshev’s theorem.

image credit: libretext

It says that most(3/4th) of our data will be at max two standard deviations from the mean.

library(tidyverse)
library(knitr)
library(kableExtra)
stock<- read.csv("~/OneDrive - MNSCU/FALL 2019/MathStat/Data/Stock Trade.csv",stringsAsFactors = FALSE)

Now let’s clean the name,

stock<- stock %>% select(percentStock = X..of.Shares.Outstanding)

The empirical rule says that 68% of the data will be within two standard deviation.

This function below: 1. standardizes the data
2. counts data within z standard deviations
3. outputs the proportion

data_within<- function(df, z){
  func_normalize<-function(x){(x-mean(x))/sd(x)}
  #>11 after removing a data point 
  df<-df %>% filter(percentStock<11)
  df_scaled<- df %>% mutate(percentStock_normal = func_normalize(percentStock)) %>% filter(abs(percentStock_normal)<z)
  proportion = dim(df_scaled)[1]/dim(df)[1]
  return (round(proportion,2))
}

Let’s collect the output in a small tibble.

tb<- tibble(
  first_std_dev = data_within(stock,1),
  second_std_dev = data_within(stock,2),
  third_std_dev = data_within(stock,3),
)

kable(tb)

first_std_dev	second_std_dev	third_std_dev
0.64	0.92	1

We can also test if our function is working correctly,

library(testthat)

## Warning: package 'testthat' was built under R version 3.5.2

normal_generated = tibble(percentStock = rnorm(10,mean = 6.2,sd = 1.2))

#Testing our function
tb_test<- tibble(
  first_std_dev = data_within(normal_generated,1),
  second_std_dev = data_within(normal_generated,2),
  third_std_dev = data_within(normal_generated,3),
)


kable(tb_test)

first_std_dev	second_std_dev	third_std_dev
0.7	1	1

testthat::expect_gt(tb_test$first_std_dev, 0.68,label = "data proportion within first deivation")

Hence, our function is working correctly.Note that the data is randomly generated every time the code is run.