𝚃𝚛𝚊𝚗𝚜𝚙𝚘𝚗𝚜𝚝𝚎𝚛

Example Page 1

Sun, 05 May 2019 00:00:00 +0100

In this tutorial, I’ll share my top 10 tips for getting started with Academic:

Tip 1

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.

Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.

Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.

Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.

Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.

Tip 2

Example Page 2

Sun, 05 May 2019 00:00:00 +0100

Here are some more tips for getting started with Academic:

Tip 3

Tip 4

Example Talk

Sat, 01 Jun 2030 13:00:00 +0000

Click on the Slides button above to view the built-in slides feature.

Slides can be added in a few ways:

Create slides using Academic’s Slides feature and link using slides parameter in the front matter of the talk file
Upload an existing slide deck to static/ and link using url_slides parameter in the front matter of the talk file
Embed your slides (e.g. Google Slides) or presentation video on this page using shortcodes.

Further talk details can easily be added to this page using Markdown and $\rm \LaTeX$ math code.

Sales Impact Analysis with Clustering and Causal effects

Mon, 30 Mar 2020 00:00:00 +0000

This project looks at how can the introduction of a discount during the holidays affect the total sale of customer groups within a timeframe of a year. The statistical techniques used are:

RFM analysis (recency, frenquency, monetary) to analyse customer behavior by examining their transaction history such as,

how recently a customer has purchased (recency)
how often they purchase (frequency)
how much the customer spends (monetary) RFM helps us identify customers who are more likely to respond to promotions.

K-means to segment customers into various category groups.

Causal impact analysis to study the impact of discounts within each customer group

Link to the github projects: https://github.com/KapilKhanal/Sales_Impact Link to the data product: https://salesimpact.herokuapp.com/

Minnesota Lake Project:ML in Production Exercise

Sun, 19 Jan 2020 00:00:00 +0000

So you have a good model? Want to make it available to serve the world?

Prototype-grade model workflow to Production land workflow

Making a good model is awesome. It does takes enormous amount of experimentation and research. When we have a decent model, that is an eureka moment.

Every model wants to go out in the real world and serve its purpose.

But to actually use the model in production is a whole another pain. Recently I have been learning ways to deploy models.
Source: Reactive machine learning book

Model Predictions as WebService

Now,as we can see it is a lifecycle. There is a lot of nuances on deploying models. The workflow has to be reproducible,elastic and easy to manage. If you end up changing the model, the infrastructure should not have to be changed. For example, I used a simple regression model for this project, now if i am training random forest model, the parts that needs to be changed should be easily changed without change in infrastructure, that is I collect all the parameters and file locations, data locations on on file say config file.Similarly, if I separate the feature engineering, feature selection part ,data validation etc on their own separate files then it will be easy to deploy[[Modular code]].

I can always train two different model and put it in the python package or cloud location like Pypi,S3 etc then i can easily retrieve those models and use it in the flask API i design just to serve the model.

Thinking each service as a different code repoisitory. We will have three different repos.

Python package for retrieving data, training model and uploading final model to PyPi,or S3

ML api: Flask Application to serve the model by downloading the model from PyPi/S3 and exposing a API to get the data and return the prediction

Another Flask/Front-end framework to get the json from API and populate the dashboard with plots and predictions

While learning about this, I came across the idea of DataOps and MLOps. I think in future, most softwares will be ML softwares doing real time prediction and inference with very little slowdown. Wait, don’t human do that?

Basic software engineering skills and gotchas:

What packages do your application need? requirements.txt

Right data types and schema for your data/database. Why making all of them StringType() is not a wise move in database.
Do you really need all those dataframe in memory? Remove or avoid storing intermediate dataframes. If possible do all the column operations in database itself
Always use Git and version your code (and your data(Data Version Control)) in the right way.The simplest way is to store all the data used in prediction in a database with the model version and predicted value.
Use venv (virtual environments). You don’t want conflicting libraries quarelling with each other.
set.seed() while training, to increase reproducibility.Use the same seed across different models.
Think of logging: what do you want to monitor? import logging in Python.
Think of how you are going to share your final model(pickel file? parametrized formula?)
A Docker image, ready to be used, is a good choice.
Separate the data collection tool from the ML pipeline. Different repository for data wrangling, ML training and dashboard/front-end
Your tools should have clear input parameters e.g., Path to the repository
The command line tool should not work if input parameters are wrong
Make config parameters very clear
A config.py file where people can tune specific configs
If you use environment variables, document them clearly
Do not use hard coded paths
During development, consider storing intermediate steps.(Rmarkdown or Jupyter notebooks)
Understand the importance of the data you are passing to your model
Pay attention to “garbage-in garbage-out”

If you would prefer more indepth resource on software skills: I found this summary notes of the famous book: Clean Code

All these different services are extensive on their own. Without a dedicated team, these services will not succeed. But this exercise was to get a general understanding of the overall ecosystem of Data and ML system. To be a good data scientist, i think it is good to get a lay of the land.

If you like to see the end product, this link will take you there lake Dashboard

Some tips blatantly copied from ,
References: http://gousios.org/courses/ml4se/building-your-ml-pipeline.html

Data Driven Public Policy

Fri, 27 Dec 2019 00:00:00 +0000

Most of the policy that got enacted has its own reasons. Policymakers and policy analyst have debated policies for years. These policies impact people’s lives but very few people had a direct say on how policy was proposed, analyzed and enacted. Democracy thrives on public opinion. Ironically, nowadays the very foundation of democracy is shaken because of public opinion. And it turns out, factual understanding of world also is getting blurry day by day. When people do not own the research fundings or are not in the policy debating loop, they disagree with policies even though it benefits them.

A lot of the policies especially science and public good policies should be data driven. National Representatives and other policymakers should not just make the data public but have a wider discussion before even presenting it on public sphere.

Conspiracies and falsified information can only be rooted out if we let people know how much money , time and effort was behind the policy. Average people do not know the level of investment when they are sharing/retweeting/making memes about falsified information. People should be made aware that science research is not just a google search. Google search is not research. One is merely fishing for information in google search. for example, only very few scholar know how much of investment it took to eradicate measles through immunization policy but now it is only taking a share/retweet to bring it back because they don’t see how many professional spent their time, how much governments invested money and how many schools/universities debated / cross checked and verfied before it got into public sphere and i think this is exactly where a general people should be included. This is only possible through sharing of public data.A repository of metadata for every single public policy that got enacted.

Minnesota Lake Project

Sun, 27 Oct 2019 00:00:00 +0000

This dataset contains lake quality in each lake and year.

MCES data The ** MCES Citizen-Assisted-Monitoring-Program(CAMP)

The goal of the MCES lake monitoring program is to obtain and provide information that enables cities, counties, lake associations, and watershed management districts to better manage TCMA lakes, thereby protecting and improving lake water quality.

CMU Sport Analytics Projects Slideshows

Wed, 21 Aug 2019 21:13:14 -0500

My CMSAC Experience

Jeremy Sanchez @_jsanchez1, Nathan Moss @CMU_Stats, and Kapil Khanal @Kapil71001628 working on soccer with @kpelechrinis pic.twitter.com/Ij2eFiJ8eH
— CMU Stats & DS (@CMU_Stats) July 26, 2019

Presenting our Final Project

My First Project at CMU Statistics :Sport Analytics Camp

The first week has been a good review of basic dplyr syntax and ggplot2 philosophy. I like how Professors and TA are always there for us. Small data manipulation problems or points being masked in scatterplots, i ran into all sort of problems.
These are a practice projects before we actually work with our choice of research projects.

Here is the schedule of this summer camp.

Last day of #CMSACamp! Jam-packed summer full of #datascience, #sportsanalytics, speakers, tours, amazing partners @TruMediaSports @albert_larcada @stat_sam @penguins @kpelechrinis @Stat_Ron @NFL @albertbayes @bklynmaks @ATLHawks @acthomasca @sarah_malle @nflscrapR @Pirates pic.twitter.com/feG2cZnGQR
— CMU Stats & DS (@CMU_Stats) July 26, 2019

Project1: Baseball

For this project, we looked into how similar the top 5 hitters are in baseball.Below is the slide we presented at the camp.

Similarly for project 2 , we did another project using tennis dataset.

Project 2: Tennis

What factors are best at predicting point ratio for a match during a Grand Slam?

Project 3:Simulating Office Environment in Analytics

This is a non-technical project but most fun project. Our class of 16 students were partitioned into 4 analytics department for a hypothetical team. There is a lot of romour on players market, where some players are up for grab who are extremely essential for our team. Also, we have to let go some players. The crazy part of this project is that time is ticking. Our boss changes her decision every few minutes as per the changes inmarket. We have to come up with a some numbers to back up some decisions we are about to recommend.

Below is the slide we prepared within 10 minutes with so many factora being changed while we were working on it.

This project shed some light on the life of working data scientists and data analysts. It’s not always about fancy graphs or complicated tongue twisting models. I learned that we start with the problem we have, collect necessary data, make new metrics as per problem, graph problems and proposed solutions so that intuitive to all concerned parties and then use models to test our hypothesis and take decision.

Project 3:

This is the final project i worked on for the half of this summer camp. We This is actually a work in progress. We will be changing a lot of things(i guess that is research, change until you no longer find a justification to change)

I chose this because soccer has been very interesting for me from my childhood. I played soccer in my high school extensively and it still fascinates me with all the complexity involved from Math ,Statistical and data point of view.

Presenting to class mates before poster presentation

Like i tweeted, I am extremely grateful for CMU Stats for letting me experience life as a data scientists.

The best 8 weeks. I got to learn so many things and enjoy Pittsburgh. The spirit at @CMU_Stats is amazing, like a Stat-Disney land. Thank you for everything especially all those free foods and tickets to game and Kennywood. #CMSACamp
— Kapil.Khanal (@almost_kapil) July 26, 2019

Data Dashboard for StockX Contest

Wed, 21 Aug 2019 21:13:14 -0500

StockX Data Contest 2019

StockX Challenge is a call for data and sneakers nerds to have fun.

source: stockX

The basic idea is this: they give you a bunch of original StockX sneaker data, then you crunch the numbers and come up with the coolest, smartest, most compelling story you can tell. It can be literally anything you want. A theory, an insight, even just a really original data visualization. It could be a novel hypothesis about resale prices you’ve always wanted to test. Or maybe it’s just a beautiful chart to visualize the data. It can be on any subject – sneakers, brands, buyers, or even StockX itself. Whatever you find interesting, just follow your bliss.

I also gave a shot on trying to come up with something useful. Below is my finished data dashboard.

My Data Dashboard for StockX

Dashboard

The link for tableau worksheet is here

Calculations on the Dashboards

Price ratio: Ratio of Sales to Retail Price for Each Sneakers
Weeks: (Order Date - Release Date) Converted in Weeks.
Median Price ratio is chosen to eliminate the effect of asymmetrical range of dates(2017-2019 not complete as 2018) and counts of sneakers sales.
Color Scale for two brands are consistent whenever there is plot relating to brands.

1) Order of Sneakers by brand for weeks from Release Date

This plot shows the total count of orders for different sneakers of two brands Both Brands are ordered before the release date. Off white has more orders than yeezy on the datasets.

It’s interesting how the demand of yeezy increased at around `90 weeks` after the release of the shoes.

2)Ratio of Sales Price to Retail Price For each Brand by Weeks

This plot look at the relation of ratio of sale price to retail price for each brands and weeks after release date. Clearly,Both Brand’s sale price is more than the retail price. The ratio of off-White increases in general regardless of the individual sneakers while the ratio of yeezy brands is somewhat noisy but it has a trend like off white. Both brand’s price ratio is increased after the release date.

3)Distribution of Median Sales price given the retail price for each brand
This plot looks in detail on how the median sale price is distributed for each sneaker. The distribution of median sale price for top 28 sneakers which were sold as least as 5 times over retail price are plotted.

4) Median Price and States

This plot is looking at the median price ratio for all the states. The color scale is chosen for the ratio and the size of the sneakers shows total sales relative to others. Which states usually pays more for sneakers? Clearly, Delaware,Vermont,Utah had some sales with high price ratio. States like California and Newyork have a lot of sales as shown by their relative sizes. The relative size is calculated by taking the log of total sales in each states. States like Wyoming have less Sales and also with lower sales ratio.

Winona Area Public Schools: Community Contribution

Wed, 21 Aug 2019 21:13:14 -0500

Winona Area Public Schools Data Visualization

Introduction:
This Project addresses the need of communication of public school data to community members in an meaningful way.Also, making the data available to general public in a proper and useable format.

There has been a wider discussion regarding the budget issue in Winona area schools. Here is the article

Primarily, this Project was focused on cleaning and visualizing the Enrollment,Expenditures and Staffing History reports of the Winona Area Public District(WAPS) available publicly through Minnesota department of education, Data Center Link:http://education.state.mn.us/MDE/Data/

Methods and Steps of Projects

1)Data Inspection/Acquisition:.
Public Data was collected by Alison Quam (Representative from WAPS District). The Data were made available in different pdf/excel files. Also, the information were scattered in different files.

2)Data Cleaning and Formatting
First,most of the pdf files were converted to excel by Tabula(Link:http://tabula.technology/) and online tool(http://pdftoexcel.com) then, they were cleaned up in proper format and stacked using Python (Pandas).

3)Data Exploration and Visualization
This part of the project is focused on addressing the questions provided by representative of WAPS(Alison Quam). Tableau was used extensively to explore the data and visualize it. Primarily, i focused on answering following questions.
1. I was curious about,how does the enrollment and capture rate(rate of new born enrolling to Kindergarten)is changing on WAPS district?.

After few meetings with representative, i realized she was more curious about how schools spends on across different programs.

2.How the expenditure per average daily membership (count of student daily served in schools) and spending on various category is changing?.

The link to the tableau file and the data is here

Now, Visual Story Begins….

This project actually helped inform the decision makers in local level. Thus, i was able to contribute to something meaningful with my python and tableau skills.

Acknowledgement

I would like to thank WAPS representative and Prof.Silas Bergen on helping and guiding me to understand the terms and calculations already done in the reports and Prof.Todd Iverson to help figure out Python code for cleaning the data.

Animation:Internet Usage

Thu, 15 Aug 2019 21:13:14 -0500

How internet is eating the world? Internet Usage animation

Internet Usage is the world bank development indicator. In this project i grabbed the world bank dataset(which is in the link provided below).

Link to the tableau worksheet

Sankey diagrams for Bacteria and antibiotics

Wed, 24 Jul 2019 21:13:14 -0500

Visually Classifying Bacteria and Antibiotics

After World War II, antibiotics earned the moniker “wonder drugs” for quickly treating previously-incurable diseases. Data was gathered to determine which drug worked best for each bacterial infection. Comparing drug performance was an enormous aid for practitioners and scientists alike. In the fall of 1951, Will Burtin published a graph showing the effectiveness of three popular antibiotics on 16 different bacteria, measured in terms of minimum inhibitory concentration.

image creidt: Ask a biologist

I am reproducing this wonderful visualization from my professor( Silas Bergen.) in ggplot2, who did this in Tableau

Let’s bring the datasets,

library(tidyverse)
library(knitr)
library(kableExtra)
df <- read.csv("https://cdn.rawgit.com/plotly/datasets/5360f5cd/Antibiotics.csv", stringsAsFactors = F)
#String as Factors is a demon. Better not bring it here ! We rarely need that beast.
#There are 16 bacteria so giving them ID to reference later..
df<-df %>% mutate(ID =seq(1:16) )

kable(head(df,n = 16))

Bacteria	Penicillin	Streptomycin	Neomycin	Gram	ID
Mycobacterium tuberculosis	800.000	5.00	2.000	negative	1
Salmonella schottmuelleri	10.000	0.80	0.090	negative	2
Proteus vulgaris	3.000	0.10	0.100	negative	3
Klebsiella pneumoniae	850.000	1.20	1.000	negative	4
Brucella abortus	1.000	2.00	0.020	negative	5
Pseudomonas aeruginosa	850.000	2.00	0.400	negative	6
Escherichia coli	100.000	0.40	0.100	negative	7
Salmonella (Eberthella) typhosa	1.000	0.40	0.008	negative	8
Aerobacter aerogenes	870.000	1.00	1.600	negative	9
Brucella antracis	0.001	0.01	0.007	positive	10
Streptococcus fecalis	1.000	1.00	0.100	positive	11
Staphylococcus aureus	0.030	0.03	0.001	positive	12
Staphylococcus albus	0.007	0.10	0.001	positive	13
Streptococcus hemolyticus	0.001	14.00	10.000	positive	14
Streptococcus viridans	0.005	10.00	40.000	positive	15
Diplococcus pneumoniae	0.005	11.00	10.000	positive	16

Before proceeding further with the data manipulation we need to think about the format of the visualization. Here we will be making our visualization on the bacteria level, that means we will have information for each bacteria, their gram stain , and the concentration of drug required .

If you look at the table above, we do have all the data we need but not on the format we are thinking. We want one information per row for each bacteria unlike above where each row has all the information of each bacteria on one single row. Let’s change the format of the data,

key_value = df %>% gather("Drug","Concentration",Penicillin:Neomycin,-Bacteria)
kable(head(key_value))

Bacteria	Gram	ID	Drug	Concentration
Mycobacterium tuberculosis	negative	1	Penicillin	800
Salmonella schottmuelleri	negative	2	Penicillin	10
Proteus vulgaris	negative	3	Penicillin	3
Klebsiella pneumoniae	negative	4	Penicillin	850
Brucella abortus	negative	5	Penicillin	1
Pseudomonas aeruginosa	negative	6	Penicillin	850

okay so, now what we need to do is add a minimum concentration information for each bacteria for each stain type. so basically a column on the gathered table above. The only thing to keep note of is that here we should group all these bacteria and select the minimum concentration. We could have done this first[basically for eacg ] and gather like above but this is my thought process.

df_min<- key_value  %>% 
  group_by(Bacteria) %>% summarise(Min = min(Concentration))
kable(head(df_min))

Bacteria	Min
Aerobacter aerogenes	1.000
Brucella abortus	0.020
Brucella antracis	0.001
Diplococcus pneumoniae	0.005
Escherichia coli	0.100
Klebsiella pneumoniae	1.000

so now, let’s join this df_min dataframe from above with df to have that minimum information in the dataframe.

df<- inner_join(df,df_min,by = "Bacteria")
df<- df %>% mutate(Best = case_when(
  Penicillin == Min~ "Penicillin",
  Neomycin == Min~ "Neomycin",
  Streptomycin == Min~ "Streptomycin"
))

Now, since the data is ready and in the format we want,

kable(head(df))

Bacteria	Penicillin	Streptomycin	Neomycin	Gram	ID	Min	Best
Mycobacterium tuberculosis	800	5.0	2.00	negative	1	2.00	Neomycin
Salmonella schottmuelleri	10	0.8	0.09	negative	2	0.09	Neomycin
Proteus vulgaris	3	0.1	0.10	negative	3	0.10	Neomycin
Klebsiella pneumoniae	850	1.2	1.00	negative	4	1.00	Neomycin
Brucella abortus	1	2.0	0.02	negative	5	0.02	Neomycin
Pseudomonas aeruginosa	850	2.0	0.40	negative	6	0.40	Neomycin

Okay, this step might be a little unintuitive but if we think with grammer of graphics philosophy this will make sense.

seq1 <- rep(1:16,each=100)
seq2 <-rep(seq(-6,6,length=100),16)
newdat <-data.frame(ID=seq1,T=seq2)
write.csv(newdat,"new_data.csv",row.names=FALSE)

We are making a new dataframe that has data point for the sigmoid curve(you can just draw sigmoid curve in R but this way it is linked with our data with ID)

#Joining the data by ID
final_df<-inner_join(df,newdat,by = "ID")
kable(head(final_df))

Bacteria	Penicillin	Streptomycin	Neomycin	Gram	ID	Min	Best	T
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-6.000000
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.878788
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.757576
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.636364
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.515151
Mycobacterium tuberculosis	800	5	2	negative	1	2	Neomycin	-5.393939

#ggplot
final_df <- final_df %>% mutate(Sigmoid = 1/(1 + exp(-T)))

okay so now we have the final dataset, we can get in the ggplot2 land.

p <- ggplot(data = final_df , aes(x = T , y = Sigmoid ))
p + geom_point()

#Making best slope
#Different slop will separate our curves
final_df<-final_df %>% mutate(bestBacSlope = case_when(
  Best =="Streptomycin" ~ 4 - ID,
  Best =="Neomycin" ~ 9 - ID,
  Best =="Penicillin" ~ 14 - ID
))

final_df<-final_df %>% mutate(curveBest = ID + bestBacSlope * Sigmoid)
#Figuring out ID and labels

label_df<-final_df %>% dplyr::select(c(ID, Bacteria))%>% group_by(Bacteria,ID) %>% summarise(count = n()) %>% dplyr::select(Bacteria,ID) %>% arrange(ID)

Below are the label we will use in y-axis

label_y= c("Mycobacterium tuberculosis" ,  "Salmonella schottmuelleri"  ,    
           "Proteus vulgaris"        ,        "Klebsiella pneumoniae"  ,        
           "Brucella abortus"      ,          "Pseudomonas aeruginosa"    ,     
           "Escherichia coli"    ,            "Salmonella (Eberthella) typhosa",
           "Aerobacter aerogenes"     ,       "Brucella antracis"    ,          
           "Streptococcus fecalis"    ,       "Staphylococcus aureus"      ,    
           "Staphylococcus albus"    ,        "Streptococcus hemolyticus"      ,
           "Streptococcus viridans"    ,      "Diplococcus pneumoniae")

Now it’s a plotting time !

#Plotting the sigmoid plots
library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.5.2

sankey <- ggplot(data = final_df, aes(x = T , y = curveBest, color =Gram,size = Min,alpha = 0.9,group = Bacteria)) + geom_line() +scale_fill_manual(values=c("green","red")) + 
    scale_y_continuous(breaks = seq(1:16) , labels = label_y)   + theme(axis.title.y = element_blank() , axis.line.x  = element_blank() , axis.ticks.x = element_blank(), axis.title.x =element_blank() , axis.text.x.bottom = element_blank() ) + 
  annotate("text", x = 6, y = 14, label = "Penicillin") +
  annotate("text", x = 6, y = 9, label = "Neomycin") +
  annotate("text", x = 6, y = 4, label = "Streptomycin") +
  annotate("text",x = 5.5,y = 15,label = "Best Antibiotics" ,size = 5, colour = 'blue')+
  theme_minimal()

sankey

Figure 1: Classification of Bacteria

Truck Factor

Tue, 23 Jul 2019 21:13:14 -0500

Truck Factor

Today I learned an interesting concept in software engineering and project management called “Truck Factor”. The minimum numbers of contributors of a project that needs to be hit by a truck before the project is crippled and unfinished.

my first thought was why would you think of such a extreme example.Seems like there is an emphasis on how important some people are to the project.This suggests a need for many heroes than a single hero. Its a good metric to see how centralized your project is in terms of contributions. You would want to many people on the project, helping each other so that if one gets hit by a bus, projects is not in serious trouble.

There is a whole science of organizing project called Agile methodology if any one is interested.

An example preprint / working paper

Sun, 07 Apr 2019 00:00:00 +0000

Click the Slides button above to demo Academic’s Markdown slides feature.

Supplementary notes can be added here, including code and math.

Slides

Tue, 05 Feb 2019 00:00:00 +0000

Welcome to Slides

Academic

Features

Efficiently write slides in Markdown
3-in-1: Create, Present, and Publish your slides
Supports speaker notes
Mobile friendly slides

Controls

Next: Right Arrow or Space
Previous: Left Arrow
Start: Home
Finish: End
Overview: Esc
Speaker notes: S
Fullscreen: F
Zoom: Alt + Click
PDF Export: E

Code Highlighting

Inline code: variable

Code block:

porridge = "blueberry"
if porridge == "blueberry":
    print("Eating...")

Math

In-line math: $x + y = z$

Block math:

$$ f\left( x \right) = \;\frac{{2\left( {x + 4} \right)\left( {x - 4} \right)}}{{\left( {x + 4} \right)\left( {x + 1} \right)}} $$

Fragments

Make content appear incrementally

{{% fragment %}} One {{% /fragment %}}
{{% fragment %}} **Two** {{% /fragment %}}
{{% fragment %}} Three {{% /fragment %}}

Press Space to play!

One Two Three

A fragment can accept two optional parameters:

class: use a custom style (requires definition in custom CSS)
weight: sets the order in which a fragment appears

Speaker Notes

Add speaker notes to your presentation

{{% speaker_note %}}
- Only the speaker can read these notes
- Press `S` key to view
{{% /speaker_note %}}

Press the S key to view the speaker notes!

Themes

black: Black background, white text, blue links (default)
white: White background, black text, blue links
league: Gray background, white text, blue links
beige: Beige background, dark text, brown links
sky: Blue background, thin dark text, blue links

night: Black background, thick white text, orange links
serif: Cappuccino background, gray text, brown links
simple: White background, black text, blue links
solarized: Cream-colored background, dark green text, blue links

Custom Slide

Customize the slide style and background

{{< slide background-image="/img/boards.jpg" >}}
{{< slide background-color="#0000FF" >}}
{{< slide class="my-style" >}}

Custom CSS Example

Let’s make headers navy colored.

Create assets/css/reveal_custom.css with:

.reveal section h1,
.reveal section h2,
.reveal section h3 {
  color: navy;
}

Questions?

Ask

Documentation

Watershed Quality in Minnesota

Sat, 23 Jun 2018 21:13:14 -0500

** Data Product Below… **

Link of the Competition the data is of : http://minneanalytics.org/minnemudac-2016/data/ Our Submission as a freshmen : MINNEMUDAC:2016
Water Quality Analytics Competiton:
A blog on Data Analytics Competition that we recently participated . We were Ranked 5th out of 19th team that participated from regional universities of Midwest USA This is the Analysis Report of a Analytics Competition that i participated in Minnesota on Nov 4 and Nov 5 in Eden Praire, Optum Technologies. This Competition was Organized by Minneanalytics[biggest analytics Group in Minneapolis], MUDAC[ Yearly analytics event of Winonat State University] and Social Data Science[a Data Science for Social Good Platform based in Minneapolis]

Thanks to my Wonderful team for collaboration and Professor for Helping this happen ! For interactive Dashboard of our Report: https://public.tableau.com/profile/malek.hakim#!/vizhome/PARCELS_Story/WaterQualityVisualizationsintheTwinCities

My first Data Analytics competition and we got honourable mention

This project is divided into two parts.

Data Cleaning and Data Management

Data Product and Presentation

First Part of this project was done in Python. This is the link of the code: https://github.com/KapilKhanal/DSCI430/blob/master/project_data_khanal.py

The Data was sponsored by MinneMUDAC as part of the Fall Data Challenge

The second part of the project was focused on the making a usable data product. The link of the code: https://github.com/KapilKhanal/DSCI430/blob/master/app.R

Below you can use this product

Testing Alcohol level

Wed, 23 May 2018 21:13:14 -0500

Is there really 5.4% alcohol in that beer brand?

We all see that a lot of brand publish on their wrapper that the alcohol level is 5.4%. Let’s say we collected the percent level of volume for those brand. We sampled randomly and measured the alcohol level ourselves

So we believe that the actual beer percent should be 5.4% but as a beer consumer, we feel sometime it’s not.

if we measure one, and found out that beer has 6.7 we would immediately complain that the brand is telling us lie that there is 5.4% . They may argue that our measuring apparatus or technique is not 100% accurate. There is no way of finding our inaccurate our measurement without measuring it multiple times or taking measurement of multiple beers. It might be the case that our measurement is 100% accurate and the beer has more alcohol than the company is saying. We don’t really know. Also, we can’t measure every single beer they ever manufactured. This is the perfect timing to test this with our statistics sense, Below we have a list of measurements from different beer randomly bought, some from midtown, some from walmart. Let’s do a t-test.

level = c(5.1,5.2,6,7,5.01,5.0,6.5,5.6,5.2,6.1,6.2,5.0)
t.test(level, mu = 5.4)

## 
##  One Sample t-test
## 
## data:  level
## t = 1.3139, df = 11, p-value = 0.2156
## alternative hypothesis: true mean is not equal to 5.4
## 95 percent confidence interval:
##  5.225010 6.093323
## sample estimates:
## mean of x 
##  5.659167

The p-value is greater than 0.05 and confidence interval [5.17 to 6.17]. Which means if 100 people have done this random sampling of beer and have calculated the confidence interval , then the mean[5.4] would have always fall in the confidence interval.

Enough with the statistical jargon? Okay let’s enjoy the beer

Police Data Challenge

Sun, 23 Jul 2017 21:13:14 -0500

Police Data Challenge: Winner Recommendations

February 1, 2018

The Police Data Challenge contest brought talented high school and undergraduate students across the nation to show their passion for the good statistics can do.

With the Police Foundation’s efforts to make the information available, the 70 teams used real crime data sets from Baltimore, Seattle and Cincinnati police departments to analyze the best possible solutions for safer communities.

Check out below how the winning teams analyzed the best way to fight crime through statistics:

Winona State University, Winona, MN Jimmy Hickey, Kapil Khanal, Luke Peacock divided the crimes into more detailed categories than what the Seattle Police Department data provided. They used the crime types and locations to discover that gun related crimes are condensed in specific areas. Their recommendation was to raise public awareness of the times and locations of high crimes and include more police for patrol.

Secretory Problem

Sun, 23 Jul 2017 21:13:14 -0500

When to give up? Exploration vs Exploitation

A lot of hard working students don’t end up being selected for the scholarships. I should know because i lost 3 years doing it.

Now i turn into a information theoretic game to find when should i have quit the whole process.

Assumption: Your best score will get you scholarship if you are one of the sufficiently prepared student.

Say, entrance exams are the games. We all agree they do behave as a game. If a student is well prepared as indicated by practice questions and exams, then getting their name in scholarship in list is basically a game of chance. This is not to say that it is not possible but given that we all have time and money constrains in our life, when is the right amount to quit. Thus, A player in this game is a sufficiently prepared, hard working student. for others, before playing this game one has to be efficiently prepared.

Now that we agree, getting your name on that list is a work of chance. Say, that you are prepared to give entrance exams 10 times but that will come at a cost of time and money. Out of 10 exams you give, say all these exams can be ranked from your best score to worst score , thus you can rank them from 1 to 10. We can agree on one thing that your best possible score has the highest chance of getting scholarship[which may not be necessarily true for all but our player is a smart, hardworking , well prepared one.].

Now, we give exams one by one and the score one get is random after some cutoff[for me it was 90]. We can all relate to the “fact” that some questions are actually random and they determine our fate.

So we don’t know which exam’s is gonna be the best score for us. so its ideal to assume that it is random. After we give each exams, we surely can rank which one was the best exams and which one was the worst.

The optimal solution is to give n/e exams before deciding to quit and quit after the n/e exams if the score on n/e + 1 is not better than the exams before.

def quit_candidate(n):
    '''Choose a exam to quit after.. from a list of n exam using 
    the optimal strategy. 1= best time to quit,n is worst time to quit'''

    exams = np.arange(1, n+1)
    np.random.shuffle(exams)
    
    stop = int(round(n/np.e)) 
    best_from_rejected = np.min(exams[:stop])
    rest = exams[stop:]
    
    try:
        return rest[rest < best_from_rejected][0]
    except IndexError:
        return exams[-1]
#Now let's see if it actually holds..by having  100,000 student give 100 exams

sim = np.array([quit_candidate(n=100) for i in range(100000)])

plt.figure(figsize=(10, 6))
plt.hist(sim, bins=100)
plt.xticks(np.arange(0, 101, 10))
plt.ylim(0, 40000)
plt.xlabel('Chosen candidate')
plt.ylabel('frequency')
plt.show()

img

We see most of the time we ended up quiting on the prime time[rank 1 is the prime time to quit]

best_candidate = []
for r in range(5, 101, 5):
    sim = np.array([quit_candidate(n=100, reject=r) for i in range(100000)])
    # np.histogram counts frequency of each candidate
    best_candidate.append(np.histogram(sim, bins=100)[0][0]/100000)

plt.figure(figsize=(10, 6))
plt.scatter(range(5, 101, 5), best_candidate)
plt.xlim(0, 100)
plt.xticks(np.arange(0, 101, 10))
plt.ylim(0, 0.4)
plt.xlabel('% of candidates rejected')
plt.ylabel('Probability of choosing best candidate')
plt.grid(True)
plt.axvline(100/np.e, ls='--', c='black')
plt.show()

img

Hence , if we decide to quit on the optimal time to quit is try giving 37% exams and quit if the score is lower than the lower score you got before.

so i was ready to give 8 exams and my score were [84,87,88,94,90,92]

37% of 8 = 3.

My score was improving after 3rd exam so i guess i was right to keep giving exams but the 5th exam my score went down i guess i should have quit then instead of giving one more exam. I lost another 3 month preparing for that.

:-by Kapil Khanal

Why do you have to wait more for the buses?

Sun, 23 Jul 2017 21:13:14 -0500

Average for group vs Individual

Inspection Paradox

Buses and trains are supposed to arrive at constant intervals, but in practice some intervals are longer than others. This means the buses do not follow schedule exactly. There is always some randomness..With your luck, you might think you are more likely to arrive during a long interval. It turns out you are right: a random arrival is more likely to fall in a long interval because, well, it’s longer..!

Let’s think of a scenario…

Suppose a Bus service in your city says they pass a station every 10 minutes. This means you will assume that when you go to station randomly you would think that the average time is 5 minutes but more often you will be waiting longer than five minutes actually 10 minutes on average.

Another example of this paradoxes is: Most of the school report there average class size. But if you, as a student that average is not accurate. Say, there are 4 classes of size 75,13,12,10. Then, the average colleges will report is $(75 + 13 +12 +10)/4 = 27.5$ but you as a prospective student, the average is different.

You are more likely to be in room with 75 students $((75*75) + (13*13)+(12*12)+(10*10))/110 = 54.89$. Hence, the average reporting is not for you. This kind of paradoxes happen everywhere.

To generalize it in a more abstract way,

This is one case where the perspective of the individual and the group differs.For group, the average is what happens but as a individual the average will not make any sense.

External Project

Wed, 27 Apr 2016 00:00:00 +0000

An example journal article

Tue, 01 Sep 2015 00:00:00 +0000

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Click the Slides button above to demo Academic’s Markdown slides feature.

Supplementary notes can be added here, including code and math.

An example conference paper

Mon, 01 Jul 2013 00:00:00 +0000

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Click the Slides button above to demo Academic’s Markdown slides feature.

Supplementary notes can be added here, including code and math.

Mon, 01 Jan 0001 00:00:00 +0000

Smoothing of Images and Edge Detection

              `The one with Criminal Behind the Gaussian Noise`

            KAPIL KHANAL, DANIEL LEW, NILIMA PANDEY

Problem Statement

In the parallel universe, Winona Police department came to us for identifying the location of a criminal. The Criminal was hiding behind the gaussian noise. We took three steps to help the department.

Identifying whether or not it was a gaussian noise

Smooth the photo using different techniques

Locate the important edges in the photograph.


%matplotlib inline
import cv2
import numpy as np
from matplotlib import pyplot as plt
#plt.style.use('ggplot')
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import display

A histogram is a graph or a plot that represents the distribution of the pixel intensities in an image
focus on the RGB color space

Calculating the histogram of an image is very useful as it gives an intuition regarding some properties of the image such as the tonal range, the contrast and the brightness

image = cv2.imread("Unidentified.png")

def histogram(image):
fig,axs = plt.subplots(2,1,figsize = (12,11))
channels = cv2.split(image)

colors = ("b", "g", "r") 

for(channel, c) in zip(channels, colors):
    histogram = cv2.calcHist([channel], [0], None, [256], [0, 256])
    axs[0].plot(histogram, color = c,linewidth=1.0)
axs[1].imshow(image[:,:,::-1])
histogram(image)

Gaussian Blur

We Performed a Gaussian blur on the image. The blur removes some of the noise before further processing the image. A appropriate sigma can be computed from trial and error.

In Gaussian Blurring, a Gaussian Kernel is used to blur the image.
cv2.GaussianBlur() function is used to blur the image
cv2.getGaussianKernel() fcuntion can be used to to create a Gaussian Kernel.
width and height of the kernel should be specified in the kernel and both of them should be positive and odd
The standard deviation in X and Y direction, sigmaX and sigmaY should also be specified.
If only sigmaX is specified, sigmaY is taken same as the sigmaX
If both sigmaX and sigmaY are given zero, they are calculated from the kernel size
Gaussian Blurring is higly effective in removing gaussian noise from the image.

Gaussian Noise

?cv2.GaussianBlur()

def smoothed_gaussian(img,window_size,sigma,hist = True):
    """ Function called by interact """
    img = cv2.GaussianBlur(img, (window_size, window_size), sigma)
    if hist:
        histogram(img)
    return img
y = interact(smoothed_gaussian,img = fixed(image),window_size = [3,5,7],sigma = widgets.IntSlider(min=0,max=10,step=1,value=4) )

interactive(children=(Dropdown(description='window_size', options=(3, 5, 7), value=3), IntSlider(value=4, desc…

Median Blur

def smoothed_median(img ,window_size,hist = True):
    """ Function called by interact """
    img = cv2.medianBlur(img, window_size)
    if hist:
        histogram(img)
    return img
y = interact(smoothed_median,img = fixed(image),window_size  = widgets.IntSlider(min=1,max=10,step=2,value=3))

interactive(children=(IntSlider(value=3, description='window_size', max=10, min=1, step=2), Checkbox(value=Tru…

Convolution

Convolution is an important operation in signal and image processing. Convolution operateson two signals (in 1D) or two images (in 2D):you can think of one as the “input” signal (or image), and the other (called the kernel) as a “filter” on the input image, producing an output image (so convolution takes two images as input and produces a third as output).

def smoothed_convolution(img,window_size,hist =True):
    kernel = np.ones((window_size,window_size))/(window_size*window_size)
    img = cv2.filter2D(img, -1,kernel)
    if hist:
        histogram(img)
    return img
y = interact(smoothed_convolution,img = fixed(image),window_size = widgets.IntSlider(min=0,max=10,step=1,value=3))

interactive(children=(IntSlider(value=3, description='window_size', max=10), Checkbox(value=True, description=…

Edge detection is one of the fundamental operations when we perform image processing. It helps us reduce the amount of data (pixels) to process and maintains the structural aspect of the image

def showEdges(img,blur,thresholds, blurType,window_size,sigma):
    """ Function called by interact """
    if blurType == 'Median':
        img = smoothed_median(img,window_size,hist = False)
    elif blurType == 'Guassian':
        img = smoothed_gaussian(img,window_size,sigma,hist = False)
    elif blurType == 'Convolution':
        img = smoothed_convolution(img,window_size,hist = False)
    
    thresh1, thresh2 = thresholds
    edges = cv2.Canny(img, thresh1, thresh2)
    plt.imshow(edges)
    
rangeSlider = widgets.IntRangeSlider(
    value = [50, 200],
    min = 0,
    max = 255,
    step = 1,
    description = 'Thresholds',
    continuous_update = True
)

y = interact(showEdges,
         img = fixed(image),
         blur = True,
         thresholds = rangeSlider,
         blurType = ['Median', 'Guassian', 'Convolution'],
         window_size = [3,5,7,9],
        sigma = widgets.IntSlider(min=0,max=10,step=1,value=4))
        

display(y)

interactive(children=(Checkbox(value=True, description='blur'), IntRangeSlider(value=(50, 200), description='T…



<function __main__.showEdges(img, blur, thresholds, blurType, window_size, sigma)>

Verifying empirical rule and Chebyshev's theorem

Mon, 01 Jan 0001 00:00:00 +0000

Empirical rule and Chebyshev’s theorem

Let’s talk about this really simple concept but powerful one. Data Distributions. A data distribution is an abstract concept(a function) that gives the the possible values of data and also how often that data is generated. When you want to talk about the all the data of your experiments at once, then talk about data distribution. A data distribution gives us the probability of how often that data will be an output if we keep repeating the experiment.

We rarely have the complete dataset from the experiment.So, it is powerful to have the an idea of how data is distributed and which data occurs more often than others. We can intuitively understand some distributions like the height of the populations. We know there will be few people with really short height while few have more height. But we are sure that most of the people will be in between.This is really convienient for us to know in advance the spread and frequency of the data.

Interesting thing is that there are more than one kinds of distributions in the world. So the convienience if knowing in advance the spread of the data will be helpful. There is a famous theorem that givrs us an idea of how our data is distributed. It’s called Chebyshev’s theorem.

image credit: libretext

It says that most(3/4th) of our data will be at max two standard deviations from the mean.

library(tidyverse)
library(knitr)
library(kableExtra)
stock<- read.csv("~/OneDrive - MNSCU/FALL 2019/MathStat/Data/Stock Trade.csv",stringsAsFactors = FALSE)

Now let’s clean the name,

stock<- stock %>% select(percentStock = X..of.Shares.Outstanding)

The empirical rule says that 68% of the data will be within two standard deviation.

This function below: 1. standardizes the data
2. counts data within z standard deviations
3. outputs the proportion

data_within<- function(df, z){
  func_normalize<-function(x){(x-mean(x))/sd(x)}
  #>11 after removing a data point 
  df<-df %>% filter(percentStock<11)
  df_scaled<- df %>% mutate(percentStock_normal = func_normalize(percentStock)) %>% filter(abs(percentStock_normal)<z)
  proportion = dim(df_scaled)[1]/dim(df)[1]
  return (round(proportion,2))
}

Let’s collect the output in a small tibble.

tb<- tibble(
  first_std_dev = data_within(stock,1),
  second_std_dev = data_within(stock,2),
  third_std_dev = data_within(stock,3),
)

kable(tb)

first_std_dev	second_std_dev	third_std_dev
0.64	0.92	1

We can also test if our function is working correctly,

library(testthat)

## Warning: package 'testthat' was built under R version 3.5.2

normal_generated = tibble(percentStock = rnorm(10,mean = 6.2,sd = 1.2))

#Testing our function
tb_test<- tibble(
  first_std_dev = data_within(normal_generated,1),
  second_std_dev = data_within(normal_generated,2),
  third_std_dev = data_within(normal_generated,3),
)


kable(tb_test)

first_std_dev	second_std_dev	third_std_dev
0.7	1	1

testthat::expect_gt(tb_test$first_std_dev, 0.68,label = "data proportion within first deivation")

Hence, our function is working correctly.Note that the data is randomly generated every time the code is run.

𝚃𝚛𝚊𝚗𝚜𝚙𝚘𝚗𝚜𝚝𝚎𝚛

Example Page 1

Tip 1

Tip 2

Example Page 2

Tip 3

Tip 4

Example Talk

Sales Impact Analysis with Clustering and Causal effects

Minnesota Lake Project:ML in Production Exercise

So you have a good model? Want to make it available to serve the world?

Prototype-grade model workflow to Production land workflow

Model Predictions as WebService

Basic software engineering skills and gotchas:

Data Driven Public Policy

Minnesota Lake Project

CMU Sport Analytics Projects Slideshows

My CMSAC Experience

My First Project at CMU Statistics :Sport Analytics Camp

Project1: Baseball

Project 2: Tennis

Project 3:Simulating Office Environment in Analytics

Project 3:

Data Dashboard for StockX Contest

StockX Data Contest 2019

My Data Dashboard for StockX

It’s interesting how the demand of yeezy increased at around 90 weeks after the release of the shoes.

Winona Area Public Schools: Community Contribution

Winona Area Public Schools Data Visualization

Methods and Steps of Projects

Acknowledgement

Animation:Internet Usage

How internet is eating the world? Internet Usage animation

Sankey diagrams for Bacteria and antibiotics

Visually Classifying Bacteria and Antibiotics

Truck Factor

Truck Factor

An example preprint / working paper

Slides

Welcome to Slides

Features

Controls

Code Highlighting

Math

Fragments

Speaker Notes

Themes

Custom Slide

Custom CSS Example

Questions?

Watershed Quality in Minnesota

Testing Alcohol level

Is there really 5.4% alcohol in that beer brand?

Police Data Challenge

Police Data Challenge: Winner Recommendations

Secretory Problem

When to give up? Exploration vs Exploitation

Why do you have to wait more for the buses?

Average for group vs Individual

External Project

An example journal article

An example conference paper

Smoothing of Images and Edge Detection

Problem Statement

Gaussian Blur

Gaussian Noise

Median Blur

Convolution

Verifying empirical rule and Chebyshev's theorem

Empirical rule and Chebyshev’s theorem

It’s interesting how the demand of yeezy increased at around `90 weeks` after the release of the shoes.