library(dplyr)
Appendix E — dplyr 速查表
dplyr 函数配合管道操作符使用,并且要求数据满足整洁格式。
在整洁数据中:
- 每个变量独占一列
- 每个观测值或样本独占一行
- 管道操作符
x |> f(y)
等价于f(x,y)
E.1 汇总观测值
对列应用汇总函数生成汇总统计表。汇总函数接收向量输入并返回单个值输出(参见汇总函数章节)。
summarize(.data, ...)
:创建汇总统计表::: {.cell}
|> summarize(avg = mean(mpg)) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 1 × 1 avg <dbl> 1 20.1
::: :::
count(.data, ..., wt = NULL, sort = FLASE, name = NULL)
:按...
变量分组的行数统计。类似函数还有tally()
,add_count()
,add_tally()
。::: {.cell}
|> count(cyl) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 3 × 2 cyl n <dbl> <int> 1 4 11 2 6 7 3 8 14
::: :::
E.2 分组观测
group_by(.data, ..., .add = FALSE, .drop = TRUE)
创建分组副本,dplyr 函数会对每个分组单独操作后合并结果::: {.cell}
|> mtcars group_by(cyl) |> summarize(avg = mean(mpg))
::: {.cell-output .cell-output-stdout}
# A tibble: 3 × 2 cyl avg <dbl> <dbl> 1 4 26.7 2 6 19.7 3 8 15.1
::: :::
rowwise(.data, ...)
将数据转换为逐行分组模式,便于按行运算(适合处理列表类列数据),详见 tidyr 速查表中的列表列工作流::: {.cell}
|> starwars rowwise() |> mutate(film_count = length(films))
::: {.cell-output .cell-output-stdout}
# A tibble: 87 × 15 # Rowwise: name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Luke Sk… 172 77 blond fair blue 19 male mascu… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… 4 Darth V… 202 136 none white yellow 41.9 male mascu… 5 Leia Or… 150 49 brown light brown 19 fema… femin… 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… 7 Beru Wh… 165 75 brown light blue 47 fema… femin… 8 R5-D4 97 32 <NA> white, red red NA none mascu… 9 Biggs D… 183 84 black light brown 24 male mascu… 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… # ℹ 77 more rows # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>, # vehicles <list>, starships <list>, film_count <int>
::: :::
ungroup(x, ...)
返回非分组数据副本
E.3 操作观测
E.3.1 提取观测
行操作函数返回满足条件的行子集表
filter(.data, ..., .preserve = FALSE)
筛选符合逻辑条件的行::: {.cell}
|> filter(mpg > 20) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 14 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 6 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 7 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 8 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 9 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 10 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 11 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 12 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 13 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 14 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
::: :::
distinct(.data, ..., .keep_all = FALSE)
删除重复行::: {.cell}
|> distinct(gear) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 3 × 1 gear <dbl> 1 4 2 3 3 5
::: :::
slice(.data, ...,, .preserve = FALSE)
按位置选择行::: {.cell}
|> slice(10:15) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 6 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 2 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4 3 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 5 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
::: :::
slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)
随机抽样行,支持数量抽样(n)和比例抽样(prop)::: {.cell}
|> slice_sample(n = 5, replace = TRUE) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 5 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 2 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2 3 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 4 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 5 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
::: :::
slice_min(.data, order_by, ..., n, prop, with_ties = TRUE)
与slice_max()
选择极值行::: {.cell}
|> slice_min(mpg, prop = 0.25) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 8 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 2 10.4 8 460 215 3 5.42 17.8 0 0 3 4 3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 6 15 8 301 335 3.54 3.57 14.6 0 1 5 8 7 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 8 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
::: :::
slice_head(.data, ..., n, prop)
与slice_tail()
选择首尾行::: {.cell}
|> slice_head(n = 5) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 5 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
::: :::
E.3.1.1 适用于 filter()
的逻辑运算符
==
<
<=
is.na()
%in%
|
xor()
!=
>
>=
!is.na()
!
&
- 其他运算符参见
?base::Logic
和?Comparison
E.3.2 排序观测
arrange(.data, ..., .by_group = FALSE)
按列值升序排列,配合desc()
实现降序::: {.cell}
|> arrange(mpg) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 2 10.4 8 460 215 3 5.42 17.8 0 0 3 4 3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 6 15 8 301 335 3.54 3.57 14.6 0 1 5 8 7 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 8 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 9 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 10 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 # ℹ 22 more rows
:::
|> arrange(desc(mpg)) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 7 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 8 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 10 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 # ℹ 22 more rows
::: :::
E.3.3 添加观测
add_row(.data, ..., .before = NULL, .after = NULL)
添加一行或多行数据::: {.cell}
|> add_row(speed = 1, dist = 1) cars
::: {.cell-output .cell-output-stdout}
speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 7 10 18 8 10 26 9 10 34 10 11 17 11 11 28 12 12 14 13 12 20 14 12 24 15 12 28 16 13 26 17 13 34 18 13 34 19 13 46 20 14 26 21 14 36 22 14 60 23 14 80 24 15 20 25 15 26 26 15 54 27 16 32 28 16 40 29 17 32 30 17 40 31 17 50 32 18 42 33 18 56 34 18 76 35 18 84 36 19 36 37 19 46 38 19 68 39 20 32 40 20 48 41 20 52 42 20 56 43 20 64 44 22 66 45 23 54 46 24 70 47 24 92 48 24 93 49 24 120 50 25 85 51 1 1
::: :::
E.4 操作变量
E.4.1 提取变量
列操作函数返回列的子集(向量或表格)
pull(.data, var = -1, name = NULL, ...)
按名称或位置提取列向量::: {.cell}
|> pull(wt) mtcars
::: {.cell-output .cell-output-stdout}
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070 [13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840 [25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
::: :::
select(.data, ...)
提取列生成新表::: {.cell}
|> select(mpg, wt) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 32 × 2 mpg wt <dbl> <dbl> 1 21 2.62 2 21 2.88 3 22.8 2.32 4 21.4 3.22 5 18.7 3.44 6 18.1 3.46 7 14.3 3.57 8 24.4 3.19 9 22.8 3.15 10 19.2 3.44 # ℹ 22 more rows
::: :::
relocate(.data, ..., .before = NULL, .after = NULL)
调整列位置::: {.cell}
|> relocate(mpg, cyl, after = last_col()) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 32 × 11 mpg cyl after disp hp drat wt qsec vs am gear <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 4 160 110 3.9 2.62 16.5 0 1 4 2 21 6 4 160 110 3.9 2.88 17.0 0 1 4 3 22.8 4 1 108 93 3.85 2.32 18.6 1 1 4 4 21.4 6 1 258 110 3.08 3.22 19.4 1 0 3 5 18.7 8 2 360 175 3.15 3.44 17.0 0 0 3 6 18.1 6 1 225 105 2.76 3.46 20.2 1 0 3 7 14.3 8 4 360 245 3.21 3.57 15.8 0 0 3 8 24.4 4 2 147. 62 3.69 3.19 20 1 0 4 9 22.8 4 2 141. 95 3.92 3.15 22.9 1 0 4 10 19.2 6 4 168. 123 3.92 3.44 18.3 1 0 4 # ℹ 22 more rows
::: :::
E.4.1.1 select()
和 across()
辅助函数
|> select(mpg:cyl) mtcars
# A tibble: 32 × 2
mpg cyl
<dbl> <dbl>
1 21 6
2 21 6
3 22.8 4
4 21.4 6
5 18.7 8
6 18.1 6
7 14.3 8
8 24.4 4
9 22.8 4
10 19.2 6
# ℹ 22 more rows
contains(match)
num_range(prefix, range)
:
,例如mpg:cyl
ends_with(match)
all_of(x)
或any_of(x, ..., vars)
!
,例如!gear
starts_with(match)
matches(match)
everything()
E.4.2 批量操作多列
<- tibble(x_1 = c(1, 2), x_2 = c(3, 4), y = c(4, 5)) df
across(.cols, .fun, ..., .name = NULL)
对多列进行统一汇总或转换操作::: {.cell}
|> summarize(across(everything(), mean)) df
::: {.cell-output .cell-output-stdout}
# A tibble: 1 × 3 x_1 x_2 y <dbl> <dbl> <dbl> 1 1.5 3.5 4.5
::: :::
c_across(.cols)
在行式数据中跨列运算::: {.cell}
|> df rowwise() |> mutate(x_total = sum(c_across(1:2)))
::: {.cell-output .cell-output-stdout}
# A tibble: 2 × 4 # Rowwise: x_1 x_2 y x_total <dbl> <dbl> <dbl> <dbl> 1 1 3 4 4 2 2 4 5 6
::: :::
E.4.3 创建新变量
对列应用向量化函数(接收向量输入并返回等长向量,参见向量化函数章节)生成新列
mutate(.data, ..., .keep = "all", .before = NULL, .after = NULL)
创建新列,类似函数add_column()
::: {.cell}
|> mutate(gpm = 1 / mpg) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 32 × 12 mpg cyl disp hp drat wt qsec vs am gear carb gpm <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 0.0476 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 0.0476 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0.0439 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 0.0467 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0.0535 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 0.0552 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0.0699 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 0.0410 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 0.0439 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 0.0521 # ℹ 22 more rows
:::
|> mutate(mtcars, gpm = 1 / mpg, .keep = "none") mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 32 × 12 mpg cyl disp hp drat wt qsec vs am gear carb gpm <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 0.0476 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 0.0476 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0.0439 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 0.0467 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0.0535 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 0.0552 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0.0699 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 0.0410 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 0.0439 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 0.0521 # ℹ 22 more rows
::: :::
rename(.data, ...)
重命名列,rename_with()
可通过函数批量重命名::: {.cell}
|> rename(miles_per_gallon = mpg) mtcars
::: {.cell-output .cell-output-stdout}
# A tibble: 32 × 11 miles_per_gallon cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 # ℹ 22 more rows
::: :::
E.5 向量化函数
E.5.1 适用于 mutate()
E.5.2 位移操作
dplyr::lag()
向前位移dplyr::lead()
向后位移
E.5.3 累积计算
dplyr::cumall()
累积逻辑与dply::cumany()
累积逻辑或cummax()
累积最大值dplyr::cummean()
累积均值cummin()
累积最小值cumprod()
累积乘积cumsum()
累积求和
E.5.4 排序计算
dplyr::cume_dist()
累积分布比例dplyr::dense_rank()
紧密排名(无间隔)dplyr::min_rank()
最小排名dplyr::ntile()
分位数分组dplyr::percent_rank()
标准化排名dplyr::row_number()
行编号(首位优先)
E.5.5 数学运算
+
,-
,/
,^
,%/%
,%%
算术运算符log()
,log2()
,log10()
对数函数<
,<=
,>
,>=
,!=
,==
比较运算符dplyr::between()
区间判断dplyr::near()
近似相等判断
E.5.6 扩展功能
dplyr::case_when()
多条件判断::: {.cell}
|> starwars mutate(type = case_when( > 200 | mass > 200 ~ "large", height == "Droid" ~ "robot", species TRUE ~ "other" ))
::: {.cell-output .cell-output-stdout}
# A tibble: 87 × 15 name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Luke Sk… 172 77 blond fair blue 19 male mascu… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… 4 Darth V… 202 136 none white yellow 41.9 male mascu… 5 Leia Or… 150 49 brown light brown 19 fema… femin… 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… 7 Beru Wh… 165 75 brown light blue 47 fema… femin… 8 R5-D4 97 32 <NA> white, red red NA none mascu… 9 Biggs D… 183 84 black light brown 24 male mascu… 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… # ℹ 77 more rows # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>, # vehicles <list>, starships <list>, type <chr>
::: :::
dplyr::coalesce()
首有效值选取dplyr::if_else()
元素级条件判断dplyr::na_if()
特定值替换为 NApmax()
元素级最大值pmin()
元素级最小值
E.6 汇总函数
E.6.1 适用于 summarize()
E.6.2 计数类
dplyr::n()
计数dplyr::n_distinct()
唯一值计数sum(!is.na())
非空值计数
E.6.3 位置度量
mean()
均值median()
中位数
E.6.4 逻辑运算
mean()
真值比例sum()
真值数量
E.6.5 顺序提取
dplyr::first()
首元素dplyr::last()
末元素dplyr::nth()
指定位置元素
E.6.6 极值提取
quantile()
分位数min()
最小值max()
最大值
E.6.7 离散度量
IQR()
四分位距mad()
绝对中位差sd()
标准差var()
方差
E.7 行名处理
整洁数据不使用行名(存储于列外部的变量)。需要行名操作时:
tibble::rownames_to_column()
行名转列::: {.cell}
<- rownames_to_column(mtcars, var = "C") a
:::
tibble::columns_to_rownames()
列转行名::: {.cell}
column_to_rownames(a, var = "C")
::: {.cell-output .cell-output-stdout}
mpg cyl disp hp drat wt qsec vs am gear carb 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
::: :::
配套函数
tibble::has_rownames()
和tibble::remove_rownames()
E.8 表格合并
<- tribble(
x ~A, ~B, ~C,
"a", "t", 1,
"b", "u", 2,
"c", "v", 3
)
<- tribble(
y ~A, ~B, ~D,
"a", "t", 3,
"b", "u", 2,
"d", "w", 1
)
E.8.1 横向合并
bind_cols(..., .name_repair)
横向拼接表格(不自动匹配列名)
E.8.2 纵向叠加
bind_rows(..., .id = NULL)
纵向合并,.id
参数可添加来源标识列
E.8.3 关系连接
扩展连接通过匹配键合并表格列:
left_join(x, y)
左连接right_join(x, y)
右连接inner_join(x, y)
内连接full_join(x, y)
全连接
过滤连接按另一个表筛选当前表:
semi_join(x, y)
交集筛选anti_join(x, y)
差集筛选
嵌套连接:
nest_join(x, y)
内连接结果嵌套存储
E.8.4 连接键设置
多列匹配:
by = join_by(col1, col2)
::: {.cell}
left_join(x, y, by = join_by(A))
::: {.cell-output .cell-output-stdout}
# A tibble: 3 × 5 A B.x C B.y D <chr> <chr> <dbl> <chr> <dbl> 1 a t 1 t 3 2 b u 2 u 2 3 c v 3 <NA> NA
:::
left_join(x, y, by = join_by(A, B))
::: {.cell-output .cell-output-stdout}
# A tibble: 3 × 4 A B C D <chr> <chr> <dbl> <dbl> 1 a t 1 3 2 b u 2 2 3 c v 3 NA
::: :::
跨名匹配:
by = join_by(col1 == col2)
::: {.cell}
left_join(x, y, by = join_by(C == D))
::: {.cell-output .cell-output-stdout}
# A tibble: 3 × 5 A.x B.x C A.y B.y <chr> <chr> <dbl> <chr> <chr> 1 a t 1 d w 2 b u 2 b u 3 c v 3 a t
::: :::
同名后缀参数:
suffix
::: {.cell}
left_join(x, y, by = join_by(C == D), suffix = c("1", "2"))
::: {.cell-output .cell-output-stdout}
# A tibble: 3 × 5 A1 B1 C A2 B2 <chr> <chr> <dbl> <chr> <chr> 1 a t 1 d w 2 b u 2 b u 3 c v 3 a t
::: :::
E.8.5 集合运算
intersect(x, y)
交集setdiff(x, y)
差集union(x, y)
并集(去重)setequal()
集合相等判断
# 设置随机种子
set.seed(123)
# 生成两个随机字母向量
= sample(letters, 5)
x = sample(letters, 8)
y print(x)
[1] "o" "s" "n" "c" "j"
print(y)
[1] "r" "v" "k" "e" "t" "n" "w" "s"
# 交集
intersect(x, y)
[1] "s" "n"
# 差集
setdiff(x, y)
[1] "o" "c" "j"
# 并集
union(x, y)
[1] "o" "s" "n" "c" "j" "r" "v" "k" "e" "t" "w"
# 集合相等
setequal(x, y)
[1] FALSE