之前靠着面向搜索引擎编程,用R勉强应付了一些作业。虽然也对R的语法和常用包做过一些整理,但遇到新的数据时还是感觉没有什么头绪。收到新学期作业后发现老师要求提高了,因此打算把就把处理过程按实操顺序整理一下,积累一些数据处理的基本套路。
后续课程进展中发现许多章节背后都有需要明确的概念、应对方法和样本处理方法和统计方法。因此过于草率的在这里概括,不如留下一个扩展链接,方便日后查询。
- The questions being asked about the data is the trace you can follow in your analysis.
- Check data in the background before we move on for more visualization
- Alway show your Raw Data to your reader
use “PackageName::function_name” to reduce ambiguous, to show the exact function
- When data size is large, you can’t explore every row just by a glance, now you need these method to guide your task.
Import Data
Load the csv file using something like
原生与readr (属于tidyverse系列)
常用的参数
Understand Data structure
summary or a glimpse
Provide some summary statistics about the data set. summary() or tibble::glimpse()
![image-20200226091122857](/Users/neil/Library/Application Support/typora-user-images/image-20200226091122857.png)
data.frame vs tibble
直接my_tb
str()
summary()
glimpse()
skimr::skim()
hist for numeric data, odd?
Association/Correlation to Pairs Plot, talk more further
we can build our own visualization
Data roles
columns’ meaning?
The problem of data
Missing Data
- Are there missing values?
- Is there a pattern to the missing values.
- mostly missing(50%,60% or 80% no standard, base on questions), and you can’t throw away
- Are there variables that are almost totally missing?
- Are there observations that are almost totally missing?
vis_miss # You can easily cluster by obs, and don’t forget to sort variable
vis_dat
colMeans(is.na(d))
Cluster your missingness to give a better understand of
Novelty variables
Correlation
for category/norminal data
mosaic
show category data, table()See counts of values.su
give a chance to your reader to handle the
for numeric data
Pairs Plot
to explore the correlations
Pairs Plot can show categorical data
Corrgram
![image-20200301201738547](/Users/neil/Library/Application Support/typora-user-images/image-20200301201738547.png)
order is just order, doesn’t change the correlation
style of calculation is more critical, the spearman can overcome the effect by outliers, so spearman introduce a different groups.
boxplot
might be outliers/novels, doesn’t prove anything, that some point could be a outlier
we could standardized to read all
IQR multiplier (only for normally distributed), so try to provide a widget to your readers
Techniques for data outliers
Scatter plot
plot
Continuity
rising order chart keep scale and center to the reader, to see the real data if they want to.
data can be explained
Problems with nominals
Class imbalance
Why do visualisation
Human brain is a magic, strong machine, for example we can identify a linear regression pattern when we see a scatter plot
Interaction
reactive in Shiny
ggploty
any dataset can be melted
1 | require(reshape2, quietly = TRUE) |
Shiny
poor debug, so do it in small step and run, small changes and test
set a breakpoint or browser(), in the current version of Rstudio they are implement as same, so we can easily use a breakpoint to debug.
react log can offer a visualised graph
Feature selection/engineer (Dimensionality)
Reference
扫描二维码,分享此文章