Data Visualization | 𝚃𝚛𝚊𝚗𝚜𝚙𝚘𝚗𝚜𝚝𝚎𝚛

CMU Sport Analytics Projects Slideshows

Wed, 21 Aug 2019 21:13:14 -0500

My CMSAC Experience

Jeremy Sanchez @_jsanchez1, Nathan Moss @CMU_Stats, and Kapil Khanal @Kapil71001628 working on soccer with @kpelechrinis pic.twitter.com/Ij2eFiJ8eH
— CMU Stats & DS (@CMU_Stats) July 26, 2019

Presenting our Final Project

My First Project at CMU Statistics :Sport Analytics Camp

The first week has been a good review of basic dplyr syntax and ggplot2 philosophy. I like how Professors and TA are always there for us. Small data manipulation problems or points being masked in scatterplots, i ran into all sort of problems.
These are a practice projects before we actually work with our choice of research projects.

Here is the schedule of this summer camp.

Last day of #CMSACamp! Jam-packed summer full of #datascience, #sportsanalytics, speakers, tours, amazing partners @TruMediaSports @albert_larcada @stat_sam @penguins @kpelechrinis @Stat_Ron @NFL @albertbayes @bklynmaks @ATLHawks @acthomasca @sarah_malle @nflscrapR @Pirates pic.twitter.com/feG2cZnGQR
— CMU Stats & DS (@CMU_Stats) July 26, 2019

Project1: Baseball

For this project, we looked into how similar the top 5 hitters are in baseball.Below is the slide we presented at the camp.

Similarly for project 2 , we did another project using tennis dataset.

Project 2: Tennis

What factors are best at predicting point ratio for a match during a Grand Slam?

Project 3:Simulating Office Environment in Analytics

This is a non-technical project but most fun project. Our class of 16 students were partitioned into 4 analytics department for a hypothetical team. There is a lot of romour on players market, where some players are up for grab who are extremely essential for our team. Also, we have to let go some players. The crazy part of this project is that time is ticking. Our boss changes her decision every few minutes as per the changes inmarket. We have to come up with a some numbers to back up some decisions we are about to recommend.

Below is the slide we prepared within 10 minutes with so many factora being changed while we were working on it.

This project shed some light on the life of working data scientists and data analysts. It’s not always about fancy graphs or complicated tongue twisting models. I learned that we start with the problem we have, collect necessary data, make new metrics as per problem, graph problems and proposed solutions so that intuitive to all concerned parties and then use models to test our hypothesis and take decision.

Project 3:

This is the final project i worked on for the half of this summer camp. We This is actually a work in progress. We will be changing a lot of things(i guess that is research, change until you no longer find a justification to change)

I chose this because soccer has been very interesting for me from my childhood. I played soccer in my high school extensively and it still fascinates me with all the complexity involved from Math ,Statistical and data point of view.

Presenting to class mates before poster presentation

Like i tweeted, I am extremely grateful for CMU Stats for letting me experience life as a data scientists.

The best 8 weeks. I got to learn so many things and enjoy Pittsburgh. The spirit at @CMU_Stats is amazing, like a Stat-Disney land. Thank you for everything especially all those free foods and tickets to game and Kennywood. #CMSACamp
— Kapil.Khanal (@almost_kapil) July 26, 2019

Data Dashboard for StockX Contest

Wed, 21 Aug 2019 21:13:14 -0500

StockX Data Contest 2019

StockX Challenge is a call for data and sneakers nerds to have fun.

source: stockX

The basic idea is this: they give you a bunch of original StockX sneaker data, then you crunch the numbers and come up with the coolest, smartest, most compelling story you can tell. It can be literally anything you want. A theory, an insight, even just a really original data visualization. It could be a novel hypothesis about resale prices you’ve always wanted to test. Or maybe it’s just a beautiful chart to visualize the data. It can be on any subject – sneakers, brands, buyers, or even StockX itself. Whatever you find interesting, just follow your bliss.

I also gave a shot on trying to come up with something useful. Below is my finished data dashboard.

My Data Dashboard for StockX

Dashboard

The link for tableau worksheet is here

Calculations on the Dashboards

Price ratio: Ratio of Sales to Retail Price for Each Sneakers
Weeks: (Order Date - Release Date) Converted in Weeks.
Median Price ratio is chosen to eliminate the effect of asymmetrical range of dates(2017-2019 not complete as 2018) and counts of sneakers sales.
Color Scale for two brands are consistent whenever there is plot relating to brands.

1) Order of Sneakers by brand for weeks from Release Date

This plot shows the total count of orders for different sneakers of two brands Both Brands are ordered before the release date. Off white has more orders than yeezy on the datasets.

It’s interesting how the demand of yeezy increased at around `90 weeks` after the release of the shoes.

2)Ratio of Sales Price to Retail Price For each Brand by Weeks

This plot look at the relation of ratio of sale price to retail price for each brands and weeks after release date. Clearly,Both Brand’s sale price is more than the retail price. The ratio of off-White increases in general regardless of the individual sneakers while the ratio of yeezy brands is somewhat noisy but it has a trend like off white. Both brand’s price ratio is increased after the release date.

3)Distribution of Median Sales price given the retail price for each brand
This plot looks in detail on how the median sale price is distributed for each sneaker. The distribution of median sale price for top 28 sneakers which were sold as least as 5 times over retail price are plotted.

4) Median Price and States

This plot is looking at the median price ratio for all the states. The color scale is chosen for the ratio and the size of the sneakers shows total sales relative to others. Which states usually pays more for sneakers? Clearly, Delaware,Vermont,Utah had some sales with high price ratio. States like California and Newyork have a lot of sales as shown by their relative sizes. The relative size is calculated by taking the log of total sales in each states. States like Wyoming have less Sales and also with lower sales ratio.

Winona Area Public Schools: Community Contribution

Wed, 21 Aug 2019 21:13:14 -0500

Winona Area Public Schools Data Visualization

Introduction:
This Project addresses the need of communication of public school data to community members in an meaningful way.Also, making the data available to general public in a proper and useable format.

There has been a wider discussion regarding the budget issue in Winona area schools. Here is the article

Primarily, this Project was focused on cleaning and visualizing the Enrollment,Expenditures and Staffing History reports of the Winona Area Public District(WAPS) available publicly through Minnesota department of education, Data Center Link:http://education.state.mn.us/MDE/Data/

Methods and Steps of Projects

1)Data Inspection/Acquisition:.
Public Data was collected by Alison Quam (Representative from WAPS District). The Data were made available in different pdf/excel files. Also, the information were scattered in different files.

2)Data Cleaning and Formatting
First,most of the pdf files were converted to excel by Tabula(Link:http://tabula.technology/) and online tool(http://pdftoexcel.com) then, they were cleaned up in proper format and stacked using Python (Pandas).

3)Data Exploration and Visualization
This part of the project is focused on addressing the questions provided by representative of WAPS(Alison Quam). Tableau was used extensively to explore the data and visualize it. Primarily, i focused on answering following questions.
1. I was curious about,how does the enrollment and capture rate(rate of new born enrolling to Kindergarten)is changing on WAPS district?.

After few meetings with representative, i realized she was more curious about how schools spends on across different programs.

2.How the expenditure per average daily membership (count of student daily served in schools) and spending on various category is changing?.

The link to the tableau file and the data is here

Now, Visual Story Begins….

This project actually helped inform the decision makers in local level. Thus, i was able to contribute to something meaningful with my python and tableau skills.

Acknowledgement

I would like to thank WAPS representative and Prof.Silas Bergen on helping and guiding me to understand the terms and calculations already done in the reports and Prof.Todd Iverson to help figure out Python code for cleaning the data.

Animation:Internet Usage

Thu, 15 Aug 2019 21:13:14 -0500

How internet is eating the world? Internet Usage animation

Internet Usage is the world bank development indicator. In this project i grabbed the world bank dataset(which is in the link provided below).

Link to the tableau worksheet

Sankey diagrams for Bacteria and antibiotics

Wed, 24 Jul 2019 21:13:14 -0500

Visually Classifying Bacteria and Antibiotics

After World War II, antibiotics earned the moniker “wonder drugs” for quickly treating previously-incurable diseases. Data was gathered to determine which drug worked best for each bacterial infection. Comparing drug performance was an enormous aid for practitioners and scientists alike. In the fall of 1951, Will Burtin published a graph showing the effectiveness of three popular antibiotics on 16 different bacteria, measured in terms of minimum inhibitory concentration.

image creidt: Ask a biologist

I am reproducing this wonderful visualization from my professor( Silas Bergen.) in ggplot2, who did this in Tableau

Let’s bring the datasets,

library(tidyverse)
library(knitr)
library(kableExtra)
df <- read.csv("https://cdn.rawgit.com/plotly/datasets/5360f5cd/Antibiotics.csv", stringsAsFactors = F)
#String as Factors is a demon. Better not bring it here ! We rarely need that beast.
#There are 16 bacteria so giving them ID to reference later..
df<-df %>% mutate(ID =seq(1:16) )

kable(head(df,n = 16))

Bacteria	Penicillin	Streptomycin	Neomycin	Gram	ID
Mycobacterium tuberculosis	800.000	5.00	2.000	negative	1
Salmonella schottmuelleri	10.000	0.80	0.090	negative	2
Proteus vulgaris	3.000	0.10	0.100	negative	3
Klebsiella pneumoniae	850.000	1.20	1.000	negative	4
Brucella abortus	1.000	2.00	0.020	negative	5
Pseudomonas aeruginosa	850.000	2.00	0.400	negative	6
Escherichia coli	100.000	0.40	0.100	negative	7
Salmonella (Eberthella) typhosa	1.000	0.40	0.008	negative	8
Aerobacter aerogenes	870.000	1.00	1.600	negative	9
Brucella antracis	0.001	0.01	0.007	positive	10
Streptococcus fecalis	1.000	1.00	0.100	positive	11
Staphylococcus aureus	0.030	0.03	0.001	positive	12
Staphylococcus albus	0.007	0.10	0.001	positive	13
Streptococcus hemolyticus	0.001	14.00	10.000	positive	14
Streptococcus viridans	0.005	10.00	40.000	positive	15
Diplococcus pneumoniae	0.005	11.00	10.000	positive	16

Before proceeding further with the data manipulation we need to think about the format of the visualization. Here we will be making our visualization on the bacteria level, that means we will have information for each bacteria, their gram stain , and the concentration of drug required .

If you look at the table above, we do have all the data we need but not on the format we are thinking. We want one information per row for each bacteria unlike above where each row has all the information of each bacteria on one single row. Let’s change the format of the data,

key_value = df %>% gather("Drug","Concentration",Penicillin:Neomycin,-Bacteria)
kable(head(key_value))

Bacteria	Gram	ID	Drug	Concentration
Mycobacterium tuberculosis	negative	1	Penicillin	800
Salmonella schottmuelleri	negative	2	Penicillin	10
Proteus vulgaris	negative	3	Penicillin	3
Klebsiella pneumoniae	negative	4	Penicillin	850
Brucella abortus	negative	5	Penicillin	1
Pseudomonas aeruginosa	negative	6	Penicillin	850

okay so, now what we need to do is add a minimum concentration information for each bacteria for each stain type. so basically a column on the gathered table above. The only thing to keep note of is that here we should group all these bacteria and select the minimum concentration. We could have done this first[basically for eacg ] and gather like above but this is my thought process.

df_min<- key_value  %>% 
  group_by(Bacteria) %>% summarise(Min = min(Concentration))
kable(head(df_min))

Bacteria	Min
Aerobacter aerogenes	1.000
Brucella abortus	0.020
Brucella antracis	0.001
Diplococcus pneumoniae	0.005
Escherichia coli	0.100
Klebsiella pneumoniae	1.000

so now, let’s join this df_min dataframe from above with df to have that minimum information in the dataframe.

df<- inner_join(df,df_min,by = "Bacteria")
df<- df %>% mutate(Best = case_when(
  Penicillin == Min~ "Penicillin",
  Neomycin == Min~ "Neomycin",
  Streptomycin == Min~ "Streptomycin"
))

Now, since the data is ready and in the format we want,

kable(head(df))

Bacteria	Penicillin	Streptomycin	Neomycin	Gram	ID	Min	Best
Mycobacterium tuberculosis	800	5.0	2.00	negative	1	2.00	Neomycin
Salmonella schottmuelleri	10	0.8	0.09	negative	2	0.09	Neomycin
Proteus vulgaris	3	0.1	0.10	negative	3	0.10	Neomycin
Klebsiella pneumoniae	850	1.2	1.00	negative	4	1.00	Neomycin
Brucella abortus	1	2.0	0.02	negative	5	0.02	Neomycin
Pseudomonas aeruginosa	850	2.0	0.40	negative	6	0.40	Neomycin

Okay, this step might be a little unintuitive but if we think with grammer of graphics philosophy this will make sense.

seq1 <- rep(1:16,each=100)
seq2 <-rep(seq(-6,6,length=100),16)
newdat <-data.frame(ID=seq1,T=seq2)
write.csv(newdat,"new_data.csv",row.names=FALSE)

We are making a new dataframe that has data point for the sigmoid curve(you can just draw sigmoid curve in R but this way it is linked with our data with ID)

#Joining the data by ID
final_df<-inner_join(df,newdat,by = "ID")
kable(head(final_df))

Bacteria	Penicillin	Streptomycin	Neomycin	Gram	ID	Min	Best	T
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-6.000000
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.878788
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.757576
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.636364
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.515151
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.393939

#ggplot
final_df <- final_df %>% mutate(Sigmoid = 1/(1 + exp(-T)))

okay so now we have the final dataset, we can get in the ggplot2 land.

p <- ggplot(data = final_df , aes(x = T , y = Sigmoid ))
p + geom_point()

#Making best slope
#Different slop will separate our curves
final_df<-final_df %>% mutate(bestBacSlope = case_when(
  Best =="Streptomycin" ~ 4 - ID,
  Best =="Neomycin" ~ 9 - ID,
  Best =="Penicillin" ~ 14 - ID
))

final_df<-final_df %>% mutate(curveBest = ID + bestBacSlope * Sigmoid)
#Figuring out ID and labels

label_df<-final_df %>% dplyr::select(c(ID, Bacteria))%>% group_by(Bacteria,ID) %>% summarise(count = n()) %>% dplyr::select(Bacteria,ID) %>% arrange(ID)

Below are the label we will use in y-axis

label_y= c("Mycobacterium tuberculosis" ,  "Salmonella schottmuelleri"  ,    
           "Proteus vulgaris"        ,        "Klebsiella pneumoniae"  ,        
           "Brucella abortus"      ,          "Pseudomonas aeruginosa"    ,     
           "Escherichia coli"    ,            "Salmonella (Eberthella) typhosa",
           "Aerobacter aerogenes"     ,       "Brucella antracis"    ,          
           "Streptococcus fecalis"    ,       "Staphylococcus aureus"      ,    
           "Staphylococcus albus"    ,        "Streptococcus hemolyticus"      ,
           "Streptococcus viridans"    ,      "Diplococcus pneumoniae")

Now it’s a plotting time !

#Plotting the sigmoid plots
library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.5.2

sankey <- ggplot(data = final_df, aes(x = T , y = curveBest, color =Gram,size = Min,alpha = 0.9,group = Bacteria)) + geom_line() +scale_fill_manual(values=c("green","red")) + 
    scale_y_continuous(breaks = seq(1:16) , labels = label_y)   + theme(axis.title.y = element_blank() , axis.line.x  = element_blank() , axis.ticks.x = element_blank(), axis.title.x =element_blank() , axis.text.x.bottom = element_blank() ) + 
  annotate("text", x = 6, y = 14, label = "Penicillin") +
  annotate("text", x = 6, y = 9, label = "Neomycin") +
  annotate("text", x = 6, y = 4, label = "Streptomycin") +
  annotate("text",x = 5.5,y = 15,label = "Best Antibiotics" ,size = 5, colour = 'blue')+
  theme_minimal()

sankey

Figure 1: Classification of Bacteria

Data Visualization | 𝚃𝚛𝚊𝚗𝚜𝚙𝚘𝚗𝚜𝚝𝚎𝚛

CMU Sport Analytics Projects Slideshows

My CMSAC Experience

My First Project at CMU Statistics :Sport Analytics Camp

Project1: Baseball

Project 2: Tennis

Project 3:Simulating Office Environment in Analytics

Project 3:

Data Dashboard for StockX Contest

StockX Data Contest 2019

My Data Dashboard for StockX

It’s interesting how the demand of yeezy increased at around 90 weeks after the release of the shoes.

Winona Area Public Schools: Community Contribution

Winona Area Public Schools Data Visualization

Methods and Steps of Projects

Acknowledgement

Animation:Internet Usage

How internet is eating the world? Internet Usage animation

Sankey diagrams for Bacteria and antibiotics

Visually Classifying Bacteria and Antibiotics

It’s interesting how the demand of yeezy increased at around `90 weeks` after the release of the shoes.