R 산점도 visualization

2022-03-30 2 분 소요

산점도에 대한 시각화 표현을 해본다.

적용데이타

통계 R에 내장된 데이타 iris 를 사용한다.

3종의 붓꽃종류(Species) 데이터에서 50개 꽃에 대해 꽃받침길이(Sepal.Length), 꽃받침너비(Sepal.Length) , 꽃잎길이(Petal.Length), 꽃잎너비(Petal.Length) 측정치를 센치미터 단위로 제공된 데이터 임

> head(iris, 6)

산점도의 기본형 (plot 사용)

# plot(x=iris$Sepal.Length, y=iris$Petal.Width , ylab = "꽃잎너비", xlab = "꽃받침길이", main = "Sample 산점도") 
# 여기서 x,y의 변수로 정의 할 경우 아래의 data=iris는 사용하지 않는다.
# ylab : y축 제목, xlab : x축 제목, main : 그래프 제목, xlim,ylim : x,y축 범위, col:종류별 색깔  ( 생략가능 옵션 )
# ※ c("red", "blue", "green")[iris$Species] 종류별로 색깔 할당 표현식, 이런 문구는 [] 에 들어가는 변수는 factor 변수에만 적용된다. factor 가 아니면 as.factor()로 변환필요

> plot(formula = Petal.Width ~ Sepal.Length, data = iris, ylab = "꽃잎너비", xlab = "꽃받침길이", 
       main = "Sample 산점도", xlim = c(4, 6) , ylim = c(0, 1) 
       col = c("red", "blue", "green")[iris$Species]
       )

산점도에 smoothing line , Encircling, jittered , geom_count (ggplot 사용)

# ggplot + geom_point + geom_smooth 활용
# geom_point 종류별로 색깔 주려면, geom_point(aes(col=Species)) , point에 size 조정시: geom_point(aes(col=Species, size=0.01)) 
# geom_smooth : smothing 라인 추가, se 옵션 : T/F smothing 라인에 iterval 추가 옵션, 
# geom_encircle : ggalt 라이브러리 안에 있음 , 아래예는 setosa 만 선택하였음
# geom_jitter : 중복된 data는 하나 점으로 찍히는, 중첩된 것을 일부 랜덤평행이동 하여 여러점으로 보여 줌 , width = .5, size=1 평행이동 폭 0.5
# geom_count : 중복되는 점이 많을 수록 점의 크기를 크게 하는 것

> library(ggplot2)
> library(ggalt) 
> encir_dt = iris[ iris$Species=='setosa',]

> ggplot(iris, aes(x=Sepal.Length, y=Petal.Width)) +
  geom_point(aes(col=Species)) +
  geom_smooth(method="auto",se=F) +
  geom_encircle(aes(x=Sepal.Length, y=Petal.Width) , data=encir_dt , color="red" , size=2, expand = 0.09 ) +
  geom_jitter(width = .5, size=1) +   
  geom_count(show.legend=F ) +     
  xlim(c(4,6)) + 
  ylim(c(0,2)) +
  labs( subtitle = "꽃받침길이 Vs 꽃잎너비",  y="꽃잎너비", x="꽃받침길이", title = "Sample 산점도", caption = "Source : 붓꽃데이터")

Bubble Chart

범주별로 상관관계를 파악 하고, Bubble의 Size도 의미를 주는 것임

# ggplot + geom_point + geom_smooth 활용
> library(ggplot2)

> ggplot(iris, aes(x=Sepal.Length, y=Petal.Width)) +
  geom_jitter(aes(col=Species, size=Petal.Width)) +
  geom_smooth(aes(col=Species), method="lm", se=F)

산점도에서 양축에 분포그림 추가

# ggMarginal 활용, type 옵션은 "density", "histogram", "boxplot", "violin", "densigram"

> library(ggplot2)
> library(ggExtra)  # ggMarginal 이 들어있는 라이브러리

> g <- ggplot(iris, aes(x=Sepal.Length, y=Petal.Width)) +
       geom_point(aes(col=Species))
> ggMarginal(g, type = "histogram", fill="transparent")

상관도 Matrix 작성

# 기본형
> pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, main = "붓꽃데이타", 
         pch = 21, bg = c("red", "blue", "green")[iris$Species], data=iris , panel=panel.smooth )

# 기본형 + 위쪽은 상관계수, 대각선 히스토그램 추가

# ------------------------------------------ 함수 Definition -----------------------------
## R helo 참고
## put histograms on the diagonal
> panel.hist <- function(x, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
} 
## with size proportional to the correlations.
> panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
## regression lm
> panel.lm <- function (x, y, col=par("col"), bg=NA, pch=par("pch"),
                     cex=1, col.smooth="black", ...) {
    points(x, y, pch = pch, col = col, bg = bg, cex = cex)
    abline(stats::lm(y ~ x), col = col.smooth, ...)
}
# ------------------------------------------ 함수 Definition End -----------------------------

> pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, main = "붓꽃데이타",
      pch = 21, bg = c("red", "blue", "green")[iris$Species], data=iris , 
      upper.panel = panel.cor, diag.panel = panel.hist , lower.panel = panel.lm )

Twitter Facebook LinkedIn

R 산점도 visualization

적용데이타

산점도의 기본형 (plot 사용)

산점도에 smoothing line , Encircling, jittered , geom_count (ggplot 사용)

Bubble Chart

산점도에서 양축에 분포그림 추가

상관도 Matrix 작성

공유하기

댓글남기기

참고

electron 에서 sqlite3설치후 Cannot find module node_sqlite3.node 오류 발생시 해결 방법

텍스트마이닝 토픽모델, LDA(Latent Dirichlet Allocation)

[d3.js] d3.js 오류

[hive.connect] thrift.transport.TTransport.TTransportException 오류 발생