之前靠着面向搜索引擎编程，用R勉强应付了一些作业。虽然也对R的语法和常用包做过一些整理，但遇到新的数据时还是感觉没有什么头绪。收到新学期作业后发现老师要求提高了，因此打算把就把处理过程按实操顺序整理一下，积累一些数据处理的基本套路。

后续课程进展中发现许多章节背后都有需要明确的概念、应对方法和样本处理方法和统计方法。因此过于草率的在这里概括，不如留下一个扩展链接，方便日后查询。

The questions being asked about the data is the trace you can follow in your analysis.
Check data in the background before we move on for more visualization
Alway show your Raw Data to your reader

use “PackageName::function_name” to reduce ambiguous, to show the exact function

When data size is large, you can’t explore every row just by a glance, now you need these method to guide your task.

Import Data

Load the csv file using something like

原生与readr (属于tidyverse系列)

常用的参数

Understand Data structure

summary or a glimpse

Provide some summary statistics about the data set. summary() or tibble::glimpse()

![image-20200226091122857](/Users/neil/Library/Application Support/typora-user-images/image-20200226091122857.png)

data.frame vs tibble

直接my_tb

str()

summary()

glimpse()

skimr::skim()

hist for numeric data, odd?

Association/Correlation to Pairs Plot, talk more further

we can build our own visualization

Data roles

columns’ meaning?

The problem of data

Missing Data

Are there missing values?
Is there a pattern to the missing values.
- mostly missing(50%,60% or 80% no standard, base on questions), and you can’t throw away
Are there variables that are almost totally missing?
Are there observations that are almost totally missing?

vis_miss # You can easily cluster by obs, and don’t forget to sort variable

vis_dat

colMeans(is.na(d))

Cluster your missingness to give a better understand of

Novelty variables

Correlation

for category/norminal data

mosaic

show category data, table()See counts of values.su

give a chance to your reader to handle the

for numeric data

Pairs Plot

to explore the correlations

Pairs Plot can show categorical data

Corrgram

![image-20200301201738547](/Users/neil/Library/Application Support/typora-user-images/image-20200301201738547.png)

order is just order, doesn’t change the correlation

style of calculation is more critical, the spearman can overcome the effect by outliers, so spearman introduce a different groups.

boxplot

might be outliers/novels, doesn’t prove anything, that some point could be a outlier

we could standardized to read all

IQR multiplier (only for normally distributed), so try to provide a widget to your readers

Techniques for data outliers

Scatter plot

plot

Continuity

rising order chart keep scale and center to the reader, to see the real data if they want to.

data can be explained

Problems with nominals

Class imbalance

Why do visualisation

Human brain is a magic, strong machine, for example we can identify a linear regression pattern when we see a scatter plot

Interaction

reactive in Shiny

ggploty

any dataset can be melted

require(reshape2, quietly = TRUE)
require(alluvial, quietly = TRUE)
melted <- melt(cm$table)
melted$colour = ifelse(melted$Prediction == melted$Reference, "green", "red")
par(mar = c(0,0,1,0))
alluvial::alluvial(
  melted[,1:2],
  freq = melted$value,
  col = melted$colour,
  alpha = 0.5,
  hide = melted$value == 0
)
mtext("Classification results", font = 2)

Shiny

poor debug, so do it in small step and run, small changes and test

set a breakpoint or browser(), in the current version of Rstudio they are implement as same, so we can easily use a breakpoint to debug.

react log can offer a visualised graph

Feature selection/engineer (Dimensionality)

Reference

Exploring your First Data Set with R

R语言学习笔记：数据类型与存储

Tags: R

← 给娃读绘本的一点收获 Bash in Context →

扫描二维码，分享此文章

用R语言完成一份数据可视化作业