<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Data Science | 𝚃𝚛𝚊𝚗𝚜𝚙𝚘𝚗𝚜𝚝𝚎𝚛</title>
    <link>https://almostkapil.netlify.com/tags/data-science/</link>
      <atom:link href="https://almostkapil.netlify.com/tags/data-science/index.xml" rel="self" type="application/rss+xml" />
    <description>Data Science</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>© 2018 Kapil Khanal</copyright><lastBuildDate>Wed, 21 Aug 2019 21:13:14 -0500</lastBuildDate>
    <image>
      <url>https://almostkapil.netlify.com/img/aph-salt-spring-zoom.jpg</url>
      <title>Data Science</title>
      <link>https://almostkapil.netlify.com/tags/data-science/</link>
    </image>
    
    <item>
      <title>Data Dashboard for StockX Contest</title>
      <link>https://almostkapil.netlify.com/post/stockx/</link>
      <pubDate>Wed, 21 Aug 2019 21:13:14 -0500</pubDate>
      <guid>https://almostkapil.netlify.com/post/stockx/</guid>
      <description>


&lt;div id=&#34;stockx-data-contest-2019&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Stock&lt;span style=&#34;color:green&#34;&gt;&lt;b&gt;X&lt;/b&gt;&lt;/span&gt; Data Contest 2019&lt;/h2&gt;
&lt;p&gt;&lt;a href = &#34;https://stockx.com/news/the-2019-data-contest/&#34;&gt;StockX Challenge&lt;/a&gt; is a call for data and sneakers nerds to have fun.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://almostkapil.netlify.com/post/stockX_files/sneaker.jpg&#34; alt=&#34;source: stockX&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;source: stockX&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The basic idea is this: they give you a bunch of original StockX sneaker data, then you crunch the numbers and come up with the coolest, smartest, most compelling story you can tell. It can be literally anything you want. A theory, an insight, even just a really original data visualization. It could be a novel hypothesis about resale prices you’ve always wanted to test. Or maybe it’s just a beautiful chart to visualize the data. It can be on any subject – sneakers, brands, buyers, or even StockX itself. Whatever you find interesting, just follow your bliss.&lt;/p&gt;
&lt;p&gt;I also gave a shot on trying to come up with something useful. Below is my finished data dashboard. &lt;br&gt;&lt;/p&gt;
&lt;div id=&#34;my-data-dashboard-for-stockx&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;My Data Dashboard for Stock&lt;span style=&#34;color:green&#34;&gt;&lt;b&gt;X&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://almostkapil.netlify.com/post/index_files/stockX.png&#34; alt=&#34;Dashboard&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Dashboard&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The link for tableau worksheet is &lt;a href = &#34;https://public.tableau.com/views/StockX_0/Dashboard1?:embed=y&amp;:display_count=yes&amp;:origin=viz_share_link&#34;&gt;here&lt;/a&gt; &lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Calculations on the Dashboards&lt;/em&gt; &lt;/br&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Price ratio&lt;/code&gt;: Ratio of Sales to Retail Price for Each Sneakers &lt;br&gt;
&lt;code&gt;Weeks&lt;/code&gt;: (Order Date - Release Date) Converted in Weeks.&lt;br&gt;
&lt;code&gt;Median Price ratio&lt;/code&gt; is chosen to eliminate the effect of asymmetrical range of dates(2017-2019 not
complete as 2018) and counts of sneakers sales.&lt;br&gt;
&lt;code&gt;Color&lt;/code&gt; Scale for two brands are consistent whenever there is plot relating to brands.&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;1) Order of Sneakers by brand for weeks from Release Date&lt;/b&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;This plot shows the total count of orders for different sneakers of two brands
Both Brands are ordered before the release date. Off white has more orders than yeezy on the datasets.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;its-interesting-how-the-demand-of-yeezy-increased-at-around-90-weeks-after-the-release-of-the-shoes.&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;It’s interesting how the demand of yeezy increased at around &lt;code&gt;90 weeks&lt;/code&gt; after the release of the shoes.&lt;/h2&gt;
&lt;p&gt;2)&lt;b&gt;Ratio of Sales Price to Retail Price For each Brand by Weeks&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;This plot look at the relation of ratio of sale price to retail price for each brands and weeks after release
date. Clearly,Both Brand’s sale price is more than the retail price. The ratio of off-White increases in
general regardless of the individual sneakers while the ratio of yeezy brands is somewhat noisy but it has
a trend like off white. Both brand’s price ratio is increased after the release date.&lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;3)Distribution of Median Sales price given the retail price for each brand&lt;/b&gt;&lt;br&gt;
This plot looks in detail on how the median sale price is distributed for each sneaker. The distribution of
median sale price for top 28 sneakers which were sold as least as 5 times over retail price are plotted.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;4) Median Price and States&lt;/b&gt; &lt;br&gt;&lt;/p&gt;
&lt;p&gt;This plot is looking at the median price ratio for all the states. The color scale is chosen for the ratio and
the size of the sneakers shows total sales relative to others. Which states usually pays more for
sneakers? Clearly, Delaware,Vermont,Utah had some sales with high price ratio. States like California and
Newyork have a lot of sales as shown by their relative sizes. The relative size is calculated by taking the
log of total sales in each states. States like Wyoming have less Sales and also with lower sales ratio.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Winona Area Public Schools: Community Contribution</title>
      <link>https://almostkapil.netlify.com/post/waps/</link>
      <pubDate>Wed, 21 Aug 2019 21:13:14 -0500</pubDate>
      <guid>https://almostkapil.netlify.com/post/waps/</guid>
      <description>


&lt;div id=&#34;winona-area-public-schools-data-visualization&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;&lt;span style=&#34;color:purple&#34;&gt;Winona Area Public Schools Data Visualization &lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;Introduction&lt;/code&gt;:&lt;br&gt;
This Project addresses the need of communication of public school data to community members in an meaningful way.Also, making the data available to general public in a proper and useable format. &lt;br&gt;&lt;/p&gt;
&lt;p&gt;There has been a wider discussion regarding the budget issue in Winona area schools. Here is
&lt;a href = &#34;https://www.winonadailynews.com/news/local/what-will-waps-cut-board-to-weigh-new-options-for/article_23c25b9f-7365-5aa2-b370-1ed251eb8231.html&#34;
 width=&#34;645&#34; height=&#34;955&#34;&gt;the article &lt;/a&gt;&lt;/p&gt;
Primarily, this Project was focused on cleaning and visualizing the Enrollment,Expenditures and Staffing History reports of the Winona Area Public District(WAPS) available publicly through Minnesota department of education, Data Center
Link:&lt;a href=&#34;http://education.state.mn.us/MDE/Data/&#34; class=&#34;uri&#34;&gt;http://education.state.mn.us/MDE/Data/&lt;/a&gt;
&lt;img src=&#34;https://almostkapil.netlify.com/post/WAPS_files/waps.png&#34; /&gt; &lt;br&gt;
&lt;h5&gt;
Methods and Steps of Projects
&lt;/h5&gt;
&lt;p&gt;1)Data Inspection/Acquisition:.&lt;br&gt;
Public Data was collected by Alison Quam (Representative from WAPS District).
The Data were made available in different pdf/excel files. Also, the information were scattered in different files.&lt;br&gt;&lt;/p&gt;
&lt;p&gt;2)Data Cleaning and Formatting&lt;br&gt;
First,most of the pdf files were converted to excel by Tabula(Link:&lt;a href=&#34;http://tabula.technology/&#34; class=&#34;uri&#34;&gt;http://tabula.technology/&lt;/a&gt;) and online tool(&lt;a href=&#34;http://pdftoexcel.com&#34; class=&#34;uri&#34;&gt;http://pdftoexcel.com&lt;/a&gt;)
then, they were cleaned up in proper format and stacked using Python (Pandas).&lt;br&gt;&lt;/p&gt;
&lt;p&gt;3)Data Exploration and Visualization &lt;br&gt;
This part of the project is focused on addressing the questions provided by representative of WAPS(Alison Quam).
Tableau was used extensively to explore the data and visualize it.
Primarily, i focused on answering following questions.&lt;br&gt;
1. &lt;span style=&#34;color:purple&#34;&gt;&lt;strong&gt;I was curious about,how does the enrollment and capture rate(rate of new born enrolling to Kindergarten)is changing on WAPS district?.&lt;/strong&gt; &lt;/span&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;After few meetings with representative, i realized she was more curious about how schools spends on across different programs.&lt;br&gt;&lt;/p&gt;
&lt;p&gt;2.&lt;span style=&#34;color:purple&#34;&gt;&lt;strong&gt;How the expenditure per average daily membership (count of student daily served in schools) and spending on various category is changing?.&lt;/span&gt;&lt;/strong&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;The link to the tableau file and the data is &lt;a href = &#34;https://public.tableau.com/views/WinonaAreaPublicSchoolsDataStory/FourthDashboard?:retry=yes&amp;:embed=y&amp;:display_count=yes&amp;:origin=viz_share_link&#34;&gt; &lt;b&gt;here&lt;/b&gt; &lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Now, Visual Story Begins….&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://almostkapil.netlify.com/post/WAPS_files/Second_Dashboard.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://almostkapil.netlify.com/post/WAPS_files/Third_Dashboard.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://almostkapil.netlify.com/post/WAPS_files/Fourth_Dashboard.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;This project actually helped inform the decision makers in local level. Thus, i was able to contribute to something meaningful with my python and tableau skills.&lt;/code&gt;&lt;/p&gt;
&lt;div id=&#34;acknowledgement&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Acknowledgement&lt;/h4&gt;
&lt;p&gt;I would like to thank WAPS representative and Prof.Silas Bergen on helping and guiding me to understand the terms and calculations already done in the reports and Prof.Todd Iverson to help figure out Python code for cleaning the data.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Animation:Internet Usage</title>
      <link>https://almostkapil.netlify.com/post/internetusage/</link>
      <pubDate>Thu, 15 Aug 2019 21:13:14 -0500</pubDate>
      <guid>https://almostkapil.netlify.com/post/internetusage/</guid>
      <description>


&lt;div id=&#34;how-internet-is-eating-the-world-internet-usage-animation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How internet is eating the world? Internet Usage animation&lt;/h2&gt;
&lt;p&gt;Internet Usage is the world bank development indicator. In this project i grabbed the world bank dataset(which is in the link provided below).&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://almostkapil.netlify.com/post/internetusage_files/internetUsage.gif&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Link to the tableau &lt;a href = &#34;https://public.tableau.com/shared/NXKC4HKX7?:display_count=yes&amp;:origin=viz_share_link&#34;&gt;worksheet&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Testing Alcohol level</title>
      <link>https://almostkapil.netlify.com/post/beer/</link>
      <pubDate>Wed, 23 May 2018 21:13:14 -0500</pubDate>
      <guid>https://almostkapil.netlify.com/post/beer/</guid>
      <description>


&lt;div id=&#34;is-there-really-5.4-alcohol-in-that-beer-brand&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Is there really 5.4% alcohol in that beer brand?&lt;/h2&gt;
&lt;p&gt;We all see that a lot of brand publish on their wrapper that the alcohol level is 5.4%. Let’s say we collected the percent level of volume for those brand. We sampled randomly and measured the alcohol level ourselves&lt;/p&gt;
&lt;p&gt;So we believe that the actual beer percent should be 5.4% but as a beer consumer, we feel sometime it’s not.&lt;/p&gt;
&lt;p&gt;if we measure one, and found out that beer has 6.7 we would immediately complain that the brand is telling us lie that there is 5.4% . They may argue that our measuring apparatus or technique is not 100% accurate. There is no way of finding our inaccurate our measurement without measuring it multiple times or taking measurement of multiple beers. It might be the case that our measurement is 100% accurate and the beer has more alcohol than the company is saying. We don’t really know. Also, we can’t measure every single beer they ever manufactured.
This is the perfect timing to test this with our statistics sense,
Below we have a list of measurements from different beer randomly bought, some from midtown, some from walmart.
Let’s do a t-test.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;level = c(5.1,5.2,6,7,5.01,5.0,6.5,5.6,5.2,6.1,6.2,5.0)
t.test(level, mu = 5.4)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##  One Sample t-test
## 
## data:  level
## t = 1.3139, df = 11, p-value = 0.2156
## alternative hypothesis: true mean is not equal to 5.4
## 95 percent confidence interval:
##  5.225010 6.093323
## sample estimates:
## mean of x 
##  5.659167&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The p-value is greater than 0.05 and confidence interval [5.17 to 6.17]. Which means if 100 people have done this random sampling of beer and have calculated the confidence interval , then the mean[5.4] would have always fall in the confidence interval.&lt;/p&gt;
&lt;p&gt;Enough with the statistical jargon? Okay let’s enjoy the beer&lt;img src=&#34;https://almostkapil.netlify.com/post/Beer_files/giphy.gif&#34; alt=&#34;Cold Beer and Confidence Interval&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Verifying empirical rule and Chebyshev&#39;s theorem</title>
      <link>https://almostkapil.netlify.com/post/untitled/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      <guid>https://almostkapil.netlify.com/post/untitled/</guid>
      <description>


&lt;div id=&#34;empirical-rule-and-chebyshevs-theorem&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Empirical rule and Chebyshev’s theorem&lt;/h2&gt;
&lt;p&gt;Let’s talk about this really simple concept but powerful one. &lt;code&gt;Data Distributions&lt;/code&gt;. A data distribution is an abstract concept(a function) that gives the the possible values of data and also how often that data is generated. When you want to talk about the all the data of your experiments at once, then talk about data distribution. A data distribution gives us the probability of how often that data will be an output if we keep repeating the experiment.&lt;/p&gt;
&lt;p&gt;We rarely have the complete dataset from the experiment.So, it is powerful to have the an idea of how data is distributed and which data occurs more often than others. We can intuitively understand some distributions like the height of the populations. We know there will be few people with really short height while few have more height. But we are sure that most of the people will be in between.This is really convienient for us to know in advance the spread and frequency of the data.
&lt;img src=&#34;https://almostkapil.netlify.com/post/Untitled_files/normal.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Interesting thing is that there are more than one kinds of distributions in the world. So the convienience if knowing in advance the spread of the data will be helpful. There is a famous theorem that givrs us an idea of how our data is distributed. It’s called Chebyshev’s theorem.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;https://almostkapil.netlify.com/post/Untitled_files/chebyshev.jpg&#34; alt=&#34;image credit: libretext&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;image credit: libretext&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;It says that most(3/4th) of our data will be at max two standard deviations from the mean.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(knitr)
library(kableExtra)
stock&amp;lt;- read.csv(&amp;quot;~/OneDrive - MNSCU/FALL 2019/MathStat/Data/Stock Trade.csv&amp;quot;,stringsAsFactors = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s clean the name,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;stock&amp;lt;- stock %&amp;gt;% select(percentStock = X..of.Shares.Outstanding)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The empirical rule says that 68% of the data will be within two standard deviation.&lt;/p&gt;
&lt;p&gt;This function below:
1. standardizes the data &lt;br&gt;
2. counts data within &lt;code&gt;z&lt;/code&gt; standard deviations &lt;br&gt;
3. outputs the proportion&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data_within&amp;lt;- function(df, z){
  func_normalize&amp;lt;-function(x){(x-mean(x))/sd(x)}
  #&amp;gt;11 after removing a data point 
  df&amp;lt;-df %&amp;gt;% filter(percentStock&amp;lt;11)
  df_scaled&amp;lt;- df %&amp;gt;% mutate(percentStock_normal = func_normalize(percentStock)) %&amp;gt;% filter(abs(percentStock_normal)&amp;lt;z)
  proportion = dim(df_scaled)[1]/dim(df)[1]
  return (round(proportion,2))
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s collect the output in a small tibble.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tb&amp;lt;- tibble(
  first_std_dev = data_within(stock,1),
  second_std_dev = data_within(stock,2),
  third_std_dev = data_within(stock,3),
)

kable(tb)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
first_std_dev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
second_std_dev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
third_std_dev
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.64
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.92
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can also test if our function is working correctly,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(testthat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;testthat&amp;#39; was built under R version 3.5.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;normal_generated = tibble(percentStock = rnorm(10,mean = 6.2,sd = 1.2))

#Testing our function
tb_test&amp;lt;- tibble(
  first_std_dev = data_within(normal_generated,1),
  second_std_dev = data_within(normal_generated,2),
  third_std_dev = data_within(normal_generated,3),
)


kable(tb_test)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
first_std_dev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
second_std_dev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
third_std_dev
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.7
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;testthat::expect_gt(tb_test$first_std_dev, 0.68,label = &amp;quot;data proportion within first deivation&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Hence, our function is working correctly.Note that the data is randomly generated every time the code is run.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
