R apply, lapply, sapply, tapply, by 함수 정리

2022-03-14 7 분 소요

apply 계열 함수는 array, data frame, vector 등 에 대해 함수를 적용하는 경우에 편리하게 사용하는 함수 이다. 통상 각데이터 연산을 할때 for문을 사용하는 것 보다 빠르게 연산을 한다.

함수 비고

종류	사용법
apply	apply(X, MARGIN, FUN, …, simplify = TRUE)
lapply	lapply(X, FUN, …) ,MARGIN 인자는 필요없음 단지 열기준(y축기준) 으로 만 적용됨
sapply	sapply , lapply의 간단형태, simplify=FALSE로 세팅하면 lapply와 유사
tapply	tapply(X, INDEX, FUN = NULL, …, default = NA, simplify = TRUE) , 범주형변수를 다루는데 유용
by	by(data, INDICES, FUN, …, simplify = TRUE) , by 함수와 유사

apply 함수

Input type	output type
Matrix, Array, Data frame	Vector, matrix, array or list

data frame으로 리턴 안됨, list는 input을 사용안됨

# matrix array 에 적용
> x <- cbind(x1 = 3, x2 = c(10:14, 20:24))
> x

##       x1 x2
##  [1,]  3 10
##  [2,]  3 11
##  [3,]  3 12
##  [4,]  3 13
##  [5,]  3 14
##  [6,]  3 20
##  [7,]  3 21
##  [8,]  3 22
##  [9,]  3 23
## [10,]  3 24

# 주로 행기준(x축기준) 연산에 적용
> apply(x,1,function(x){ x })  # 행 기준으로 ft 적용

##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## x1    3    3    3    3    3    3    3    3    3     3
## x2   10   11   12   13   14   20   21   22   23    24

> apply(x,2:1,function(x){ x }) #  apply(x,c(2,1),function(x){ x }) 동일 (원소별 적용)

##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## x1    3    3    3    3    3    3    3    3    3     3
## x2   10   11   12   13   14   20   21   22   23    24

> apply(x,1, function(x){ sum(x) } )  # apply(x,1,sum) 동일표현, x[,1]+x[,2] 와 동일함

##  [1] 13 14 15 16 17 23 24 25 26 27

> apply(x,2:1,sum) # 원소별 합산 그대로 출력

##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## x1    3    3    3    3    3    3    3    3    3     3
## x2   10   11   12   13   14   20   21   22   23    24

# 주로 열기준(y축기준) 연산에 적용
> apply(x,2,function(x){ x })  # 열 기준으로 ft 적용

##       x1 x2
##  [1,]  3 10
##  [2,]  3 11
##  [3,]  3 12
##  [4,]  3 13
##  [5,]  3 14
##  [6,]  3 20
##  [7,]  3 21
##  [8,]  3 22
##  [9,]  3 23
## [10,]  3 24

> apply(x,1:2,function(x){ x }) # apply(x,c(1,2),function(x){ x }) 동일 (원소별 적용)

##       x1 x2
##  [1,]  3 10
##  [2,]  3 11
##  [3,]  3 12
##  [4,]  3 13
##  [5,]  3 14
##  [6,]  3 20
##  [7,]  3 21
##  [8,]  3 22
##  [9,]  3 23
## [10,]  3 24

> apply(x,2, function(x){ sum(x) } )  # 열기준 합산  , c(x1=sum(x[,1]),  x2=sum(x[,2])) 와 동일함

##  x1  x2 
##  30 170

> apply(x,2,function(param){ param + x[,1] })[,2] # 각 컬럼에 첫번째 컬럼을  더해서 2번째 컬럼만 출력

##  [1] 13 14 15 16 17 23 24 25 26 27

> apply(x,1:2,sum) # 원소별 합산 그대로 출력

##       x1 x2
##  [1,]  3 10
##  [2,]  3 11
##  [3,]  3 12
##  [4,]  3 13
##  [5,]  3 14
##  [6,]  3 20
##  [7,]  3 21
##  [8,]  3 22
##  [9,]  3 23
## [10,]  3 24

# 원소별 적용

> apply(x,c(2,1),function(x){ x })

##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## x1    3    3    3    3    3    3    3    3    3     3
## x2   10   11   12   13   14   20   21   22   23    24

lapply 함수 ( apply 와의 차이는 output 형식과 Margin 지정이 없는것)

list 적용 함수로 lapply와 rapply가 있다( 차이는 lapply는 컬럼별 함수 적용, rapply 구성요소 하나하나 함수 적용)

Input type	output type
Vector ,List ,Data frame	list

# list 적용예제
> list1<-list(maths=c(64,45,89,67),english=c(79,84,62,80),physics=c(68,72,69,80),chemistry = c(99,91,84,89))

> lapply(list1,mean) # column 별(즉 키별로 group by 개념과 비슷) 평균 산출

## $maths
## [1] 66.25
## 
## $english
## [1] 76.25
## 
## $physics
## [1] 72.25
## 
## $chemistry
## [1] 90.75

> class(lapply(list1,mean)) # return value type 확인

## [1] "list"

> lapply(list1,function(x) { x + 100 }) # 각 원소에 100  더함

## $maths
## [1] 164 145 189 167
## 
## $english
## [1] 179 184 162 180
## 
## $physics
## [1] 168 172 169 180
## 
## $chemistry
## [1] 199 191 184 189

# 각 원소에 100  더함 , rapply 비교 - 결과는 구성요소 하나하나 나오게됨 
# rapply(list1,function(x) { x + 100 }, how="list")  옵션을 주게 되면 동일함
> rapply(list1,function(x) { x + 100 })

maths1     maths2     maths3     maths4   english1   english2   english3   english4   physics1   physics2   physics3   physics4 chemistry1 chemistry2 
   164        145        189        167        179        184        162        180        168        172        169        180        199        191  chemistry3 chemistry4 
   184        189 

# data frame 적용예제
> ratings <- c(4.2, 4.4, 3.4, 3.9, 5, 4.1, 3.2, 3.9, 4.6, 4.8, 5, 4, 4.5, 3.9, 4.7, 3.6)
> employee.mat <- matrix(ratings,byrow=TRUE,nrow=4,dimnames = list(c("Q1","Q2","Q3","Q4"),c("Kari","Thri","Smith","Jane")))
> employee <- as.data.frame(employee.mat)
> employee

##    Kari Thri Smith Jane
## Q1  4.2  4.4   3.4  3.9
## Q2  5.0  4.1   3.2  3.9
## Q3  4.6  4.8   5.0  4.0
## Q4  4.5  3.9   4.7  3.6

> lapply(employee,cumsum) # column별 누적합계

## $Kari
## [1]  4.2  9.2 13.8 18.3
## 
## $Thri
## [1]  4.4  8.5 13.3 17.2
## 
## $Smith
## [1]  3.4  6.6 11.6 16.3
## 
## $Jane
## [1]  3.9  7.8 11.8 15.4

#사용자 정의로 행 수만큼 loop가 도는지 확인
> count = 0  #전역변수 선언
> printft<-function(x) { count <<- count + 1  }  # 함수안에서 전역변수 변경은 <<- 로 해야함
> lapply(employee,printft)

## $Kari
## [1] 1
## 
## $Thri
## [1] 2
## 
## $Smith
## [1] 3
## 
## $Jane
## [1] 4

> classft<-function(x) { class(x) } #클래스 확인
> lapply(employee,printft) # y축에 해당하는 데이타 타입

## $Kari
## [1] 5
## 
## $Thri
## [1] 6
## 
## $Smith
## [1] 7
## 
## $Jane
## [1] 8

#사용자정의함수적용
> passft<-function(x) {
  return ( if( x[[1]]>4.2) 
            { 'pass'} 
           else 
             { 'failure'} 
          )
}
> lapply(employee,passft)

## $Kari
## [1] "failure"
## 
## $Thri
## [1] "pass"
## 
## $Smith
## [1] "failure"
## 
## $Jane
## [1] "failure"

> passft<-function(x) {
     return ( ifelse( x > 4.2,"Pass","Failure"  )) # column 데이터 배열이 들어오기 때문에 column 비교 함수를 적용해야함
     
 }
> lapply(employee,passft) # 사용자 정의함수 적용

## $Kari
## [1] "Failure" "Pass"    "Pass"    "Pass"   
## 
## $Thri
## [1] "Pass"    "Failure" "Pass"    "Failure"
## 
## $Smith
## [1] "Failure" "Failure" "Pass"    "Pass"   
## 
## $Jane
## [1] "Failure" "Failure" "Failure" "Failure"

> check <- function(x) {
  return (x[x>4.2])
}
> lapply(employee,check) # 사용자 정의함수 적용

## $Kari
## [1] 5.0 4.6 4.5
## 
## $Thri
## [1] 4.4 4.8
## 
## $Smith
## [1] 5.0 4.7
## 
## $Jane
## numeric(0)

sapply 함수

Input type	output type
Vector ,List ,Data frame	Vector, matrix, array or list

> sapply(employee,mean)  # lapply 와의 차이점은 lapply 리턴은 list, sapply의 리턴은 Vector

##  Kari  Thri Smith  Jane 
## 4.575 4.300 4.075 3.850

> sapply(employee,mean,simplify=FALSE )  # lapply 와 동일하게 된다.

## $Kari
## [1] 4.575
## 
## $Thri
## [1] 4.3
## 
## $Smith
## [1] 4.075
## 
## $Jane
## [1] 3.85

> sapply(employee,passft) # 사용자 정의함수 적용

##      Kari      Thri      Smith     Jane     
## [1,] "Failure" "Pass"    "Failure" "Failure"
## [2,] "Pass"    "Failure" "Failure" "Failure"
## [3,] "Pass"    "Pass"    "Pass"    "Failure"
## [4,] "Pass"    "Failure" "Pass"    "Failure"

tapply 함수

tapply(column 1, column 2, FUN) ,column1 : numeric column , column2

factor obejct

> salary <- c(2150000,2970000,3240000,3490000,4520000)
> allowance <- c(200000,300000,300000,300000,400000)
> designation<-c("초급","중급","중급","중급","관리자")
> gender <- c("M","F","F","M","M")

> tapply(salary,designation,mean)

##  관리자    중급    초급 
## 4520000 3233333 2150000

# 사용자 Function 
> userft<-function(x) {
     return ( ifelse( x >= 2500000,"250만 이상","250만 미만"  )) # column 데이터 배열이 들어오기 때문에 column 비교 함수를 적용해야함
}
> tapply(salary,designation,userft)

## $관리자
## [1] "250만 이상"
## 
## $중급
## [1] "250만 이상" "250만 이상" "250만 이상"
## 
## $초급
## [1] "250만 미만"

by 함수

tapply 유사함

> by(salary,designation,mean) # 리턴값이 tapply array 이고 by 는 by class

## designation: 관리자
## [1] 4520000
## ------------------------------------------------------------ 
## designation: 중급
## [1] 3233333
## ------------------------------------------------------------ 
## designation: 초급
## [1] 2150000

> by(salary,list(designation,gender),mean)

## : 관리자
## : F
## [1] NA
## ------------------------------------------------------------ 
## : 중급
## : F
## [1] 3105000
## ------------------------------------------------------------ 
## : 초급
## : F
## [1] NA
## ------------------------------------------------------------ 
## : 관리자
## : M
## [1] 4520000
## ------------------------------------------------------------ 
## : 중급
## : M
## [1] 3490000
## ------------------------------------------------------------ 
## : 초급
## : M
## [1] 2150000

> by(list(salary,allowance) ,designation,colMeans)

## designation: 관리자
## c.2150000..2970000..3240000..3490000..4520000. 
##                                        4520000 
##           c.2e.05..3e.05..3e.05..3e.05..4e.05. 
##                                         400000 
## ------------------------------------------------------------ 
## designation: 중급
## c.2150000..2970000..3240000..3490000..4520000. 
##                                        3233333 
##           c.2e.05..3e.05..3e.05..3e.05..4e.05. 
##                                         300000 
## ------------------------------------------------------------ 
## designation: 초급
## c.2150000..2970000..3240000..3490000..4520000. 
##                                        2150000 
##           c.2e.05..3e.05..3e.05..3e.05..4e.05. 
##                                         200000

> by(salary,designation,userft)

## designation: 관리자
## [1] "250만 이상"
## ------------------------------------------------------------ 
## designation: 중급
## [1] "250만 이상" "250만 이상" "250만 이상"
## ------------------------------------------------------------ 
## designation: 초급
## [1] "250만 미만"

> by(salary,list(designation,gender),userft)

## : 관리자
## : F
## NULL
## ------------------------------------------------------------ 
## : 중급
## : F
## [1] "250만 이상" "250만 이상"
## ------------------------------------------------------------ 
## : 초급
## : F
## NULL
## ------------------------------------------------------------ 
## : 관리자
## : M
## [1] "250만 이상"
## ------------------------------------------------------------ 
## : 중급
## : M
## [1] "250만 이상"
## ------------------------------------------------------------ 
## : 초급
## : M
## [1] "250만 미만"

Twitter Facebook LinkedIn

R apply, lapply, sapply, tapply, by 함수 정리

함수 비고

apply 함수

lapply 함수 ( apply 와의 차이는 output 형식과 Margin 지정이 없는것)

sapply 함수

tapply 함수

by 함수

공유하기

댓글남기기

참고

electron 에서 sqlite3설치후 Cannot find module node_sqlite3.node 오류 발생시 해결 방법

텍스트마이닝 토픽모델, LDA(Latent Dirichlet Allocation)

[d3.js] d3.js 오류

[hive.connect] thrift.transport.TTransport.TTransportException 오류 발생