- The questions being asked about the data is the trace you can follow in your analysis.
- Check data in the background before we move on for more visualization
- Alway show your Raw Data to your reader
use “PackageName::function_name” to reduce ambiguous, to show the exact function
- When data size is large, you can’t explore every row just by a glance, now you need these method to guide your task.
Import Data
Load the csv file using something like
原生与readr (属于tidyverse系列)
Understand Data structure
summary or a glimpse
Provide some summary statistics about the data set. summary() or tibble::glimpse()
![image-20200226091122857](/Users/neil/Library/Application Support/typora-user-images/image-20200226091122857.png)
data.frame vs tibble
hist for numeric data, odd?
Association/Correlation to Pairs Plot, talk more further
we can build our own visualization
Data roles
columns’ meaning?
The problem of data
Missing Data
- Are there missing values?
- Is there a pattern to the missing values.
- mostly missing(50%,60% or 80% no standard, base on questions), and you can’t throw away
- Are there variables that are almost totally missing?
- Are there observations that are almost totally missing?
vis_miss # You can easily cluster by obs, and don’t forget to sort variable
Cluster your missingness to give a better understand of
Novelty variables
for category/norminal data
show category data, table()See counts of values.su
give a chance to your reader to handle the
for numeric data
Pairs Plot
to explore the correlations
Pairs Plot can show categorical data
![image-20200301201738547](/Users/neil/Library/Application Support/typora-user-images/image-20200301201738547.png)
order is just order, doesn’t change the correlation
style of calculation is more critical, the spearman can overcome the effect by outliers, so spearman introduce a different groups.
might be outliers/novels, doesn’t prove anything, that some point could be a outlier
we could standardized to read all
IQR multiplier (only for normally distributed), so try to provide a widget to your readers
Techniques for data outliers
Scatter plot
rising order chart keep scale and center to the reader, to see the real data if they want to.
data can be explained
Problems with nominals
Class imbalance
Why do visualisation
Human brain is a magic, strong machine, for example we can identify a linear regression pattern when we see a scatter plot
reactive in Shiny
any dataset can be melted
1 | require(reshape2, quietly = TRUE) |
poor debug, so do it in small step and run, small changes and test
set a breakpoint or browser(), in the current version of Rstudio they are implement as same, so we can easily use a breakpoint to debug.
react log can offer a visualised graph
Feature selection/engineer (Dimensionality)