Appendix E — dplyr 速查表

dplyr 函数配合管道操作符使用,并且要求数据满足整洁格式。

在整洁数据中:

library(dplyr)

E.1 汇总观测值

对列应用汇总函数生成汇总统计表。汇总函数接收向量输入并返回单个值输出(参见汇总函数章节)。

  • summarize(.data, ...):创建汇总统计表

    ::: {.cell}

    mtcars |> summarize(avg = mean(mpg))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 1 × 1
        avg
      <dbl>
    1  20.1

    ::: :::

  • count(.data, ..., wt = NULL, sort = FLASE, name = NULL):按 ... 变量分组的行数统计。类似函数还有 tally(), add_count(), add_tally()

    ::: {.cell}

    mtcars |> count(cyl)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 3 × 2
        cyl     n
      <dbl> <int>
    1     4    11
    2     6     7
    3     8    14

    ::: :::

E.2 分组观测

  • group_by(.data, ..., .add = FALSE, .drop = TRUE) 创建分组副本,dplyr 函数会对每个分组单独操作后合并结果

    ::: {.cell}

    mtcars |>
      group_by(cyl) |>
      summarize(avg = mean(mpg))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 3 × 2
        cyl   avg
      <dbl> <dbl>
    1     4  26.7
    2     6  19.7
    3     8  15.1

    ::: :::

  • rowwise(.data, ...) 将数据转换为逐行分组模式,便于按行运算(适合处理列表类列数据),详见 tidyr 速查表中的列表列工作流

    ::: {.cell}

    starwars |>
      rowwise() |>
      mutate(film_count = length(films))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 87 × 15
    # Rowwise: 
       name     height  mass hair_color skin_color eye_color birth_year sex   gender
       <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
     1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
     2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
     3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
     4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
     5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
     6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
     7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
     8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
     9 Biggs D…    183    84 black      light      brown           24   male  mascu…
    10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
    # ℹ 77 more rows
    # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
    #   vehicles <list>, starships <list>, film_count <int>

    ::: :::

  • ungroup(x, ...) 返回非分组数据副本

E.3 操作观测

E.3.1 提取观测

行操作函数返回满足条件的行子集表

  • filter(.data, ..., .preserve = FALSE) 筛选符合逻辑条件的行

    ::: {.cell}

    mtcars |> filter(mpg > 20)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 14 × 11
         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1  21       6 160     110  3.9   2.62  16.5     0     1     4     4
     2  21       6 160     110  3.9   2.88  17.0     0     1     4     4
     3  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
     4  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
     5  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
     6  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
     7  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
     8  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
     9  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
    10  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
    11  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
    12  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
    13  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
    14  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

    ::: :::

  • distinct(.data, ..., .keep_all = FALSE) 删除重复行

    ::: {.cell}

    mtcars |> distinct(gear)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 3 × 1
       gear
      <dbl>
    1     4
    2     3
    3     5

    ::: :::

  • slice(.data, ...,, .preserve = FALSE) 按位置选择行

    ::: {.cell}

    mtcars |> slice(10:15)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 6 × 11
        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
    2  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
    3  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
    4  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
    5  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
    6  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4

    ::: :::

  • slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE) 随机抽样行,支持数量抽样(n)和比例抽样(prop)

    ::: {.cell}

    mtcars |> slice_sample(n = 5, replace = TRUE)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 5 × 11
        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1  26       4  120.    91  4.43  2.14  16.7     0     1     5     2
    2  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2
    3  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
    4  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
    5  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4

    ::: :::

  • slice_min(.data, order_by, ..., n, prop, with_ties = TRUE)slice_max() 选择极值行

    ::: {.cell}

    mtcars |> slice_min(mpg, prop = 0.25)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 8 × 11
        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
    2  10.4     8  460    215  3     5.42  17.8     0     0     3     4
    3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
    4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
    5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
    6  15       8  301    335  3.54  3.57  14.6     0     1     5     8
    7  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
    8  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2

    ::: :::

  • slice_head(.data, ..., n, prop)slice_tail() 选择首尾行

    ::: {.cell}

    mtcars |> slice_head(n = 5)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 5 × 11
        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
    2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
    3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
    4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
    5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2

    ::: :::

E.3.1.1 适用于 filter() 的逻辑运算符

  • ==
  • <
  • <=
  • is.na()
  • %in%
  • |
  • xor()
  • !=
  • >
  • >=
  • !is.na()
  • !
  • &
  • 其他运算符参见 ?base::Logic?Comparison

E.3.2 排序观测

  • arrange(.data, ..., .by_group = FALSE) 按列值升序排列,配合 desc() 实现降序

    ::: {.cell}

    mtcars |> arrange(mpg)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 32 × 11
         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
     2  10.4     8  460    215  3     5.42  17.8     0     0     3     4
     3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
     4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
     5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
     6  15       8  301    335  3.54  3.57  14.6     0     1     5     8
     7  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
     8  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
     9  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
    10  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
    # ℹ 22 more rows

    :::

    mtcars |> arrange(desc(mpg))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 32 × 11
         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
     2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
     3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
     4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
     5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
     6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
     7  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
     8  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
     9  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
    10  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
    # ℹ 22 more rows

    ::: :::

E.3.3 添加观测

  • add_row(.data, ..., .before = NULL, .after = NULL) 添加一行或多行数据

    ::: {.cell}

    cars |> add_row(speed = 1, dist = 1)

    ::: {.cell-output .cell-output-stdout}

       speed dist
    1      4    2
    2      4   10
    3      7    4
    4      7   22
    5      8   16
    6      9   10
    7     10   18
    8     10   26
    9     10   34
    10    11   17
    11    11   28
    12    12   14
    13    12   20
    14    12   24
    15    12   28
    16    13   26
    17    13   34
    18    13   34
    19    13   46
    20    14   26
    21    14   36
    22    14   60
    23    14   80
    24    15   20
    25    15   26
    26    15   54
    27    16   32
    28    16   40
    29    17   32
    30    17   40
    31    17   50
    32    18   42
    33    18   56
    34    18   76
    35    18   84
    36    19   36
    37    19   46
    38    19   68
    39    20   32
    40    20   48
    41    20   52
    42    20   56
    43    20   64
    44    22   66
    45    23   54
    46    24   70
    47    24   92
    48    24   93
    49    24  120
    50    25   85
    51     1    1

    ::: :::

E.4 操作变量

E.4.1 提取变量

列操作函数返回列的子集(向量或表格)

  • pull(.data, var = -1, name = NULL, ...) 按名称或位置提取列向量

    ::: {.cell}

    mtcars |> pull(wt)

    ::: {.cell-output .cell-output-stdout}

     [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
    [13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
    [25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780

    ::: :::

  • select(.data, ...) 提取列生成新表

    ::: {.cell}

    mtcars |> select(mpg, wt)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 32 × 2
         mpg    wt
       <dbl> <dbl>
     1  21    2.62
     2  21    2.88
     3  22.8  2.32
     4  21.4  3.22
     5  18.7  3.44
     6  18.1  3.46
     7  14.3  3.57
     8  24.4  3.19
     9  22.8  3.15
    10  19.2  3.44
    # ℹ 22 more rows

    ::: :::

  • relocate(.data, ..., .before = NULL, .after = NULL) 调整列位置

    ::: {.cell}

    mtcars |> relocate(mpg, cyl, after = last_col())

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 32 × 11
         mpg   cyl after  disp    hp  drat    wt  qsec    vs    am  gear
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1  21       6     4  160    110  3.9   2.62  16.5     0     1     4
     2  21       6     4  160    110  3.9   2.88  17.0     0     1     4
     3  22.8     4     1  108     93  3.85  2.32  18.6     1     1     4
     4  21.4     6     1  258    110  3.08  3.22  19.4     1     0     3
     5  18.7     8     2  360    175  3.15  3.44  17.0     0     0     3
     6  18.1     6     1  225    105  2.76  3.46  20.2     1     0     3
     7  14.3     8     4  360    245  3.21  3.57  15.8     0     0     3
     8  24.4     4     2  147.    62  3.69  3.19  20       1     0     4
     9  22.8     4     2  141.    95  3.92  3.15  22.9     1     0     4
    10  19.2     6     4  168.   123  3.92  3.44  18.3     1     0     4
    # ℹ 22 more rows

    ::: :::

E.4.1.1 select()across() 辅助函数

mtcars |> select(mpg:cyl)
# A tibble: 32 × 2
     mpg   cyl
   <dbl> <dbl>
 1  21       6
 2  21       6
 3  22.8     4
 4  21.4     6
 5  18.7     8
 6  18.1     6
 7  14.3     8
 8  24.4     4
 9  22.8     4
10  19.2     6
# ℹ 22 more rows
  • contains(match)
  • num_range(prefix, range)
  • :,例如 mpg:cyl
  • ends_with(match)
  • all_of(x)any_of(x, ..., vars)
  • !,例如 !gear
  • starts_with(match)
  • matches(match)
  • everything()

E.4.2 批量操作多列

df <- tibble(x_1 = c(1, 2), x_2 = c(3, 4), y = c(4, 5))
  • across(.cols, .fun, ..., .name = NULL) 对多列进行统一汇总或转换操作

    ::: {.cell}

    df |> summarize(across(everything(), mean))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 1 × 3
        x_1   x_2     y
      <dbl> <dbl> <dbl>
    1   1.5   3.5   4.5

    ::: :::

  • c_across(.cols) 在行式数据中跨列运算

    ::: {.cell}

    df |> 
      rowwise() |>
      mutate(x_total = sum(c_across(1:2)))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 2 × 4
    # Rowwise: 
        x_1   x_2     y x_total
      <dbl> <dbl> <dbl>   <dbl>
    1     1     3     4       4
    2     2     4     5       6

    ::: :::

E.4.3 创建新变量

对列应用向量化函数(接收向量输入并返回等长向量,参见向量化函数章节)生成新列

  • mutate(.data, ..., .keep = "all", .before = NULL, .after = NULL) 创建新列,类似函数 add_column()

    ::: {.cell}

    mtcars |> mutate(gpm = 1 / mpg)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 32 × 12
         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    gpm
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
     1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 0.0476
     2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 0.0476
     3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 0.0439
     4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 0.0467
     5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 0.0535
     6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 0.0552
     7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 0.0699
     8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2 0.0410
     9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 0.0439
    10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 0.0521
    # ℹ 22 more rows

    :::

    mtcars |> mutate(mtcars, gpm = 1 / mpg, .keep = "none")

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 32 × 12
         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    gpm
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
     1  21       6  160    110  3.9   2.62  16.5     0     1     4     4 0.0476
     2  21       6  160    110  3.9   2.88  17.0     0     1     4     4 0.0476
     3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 0.0439
     4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 0.0467
     5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 0.0535
     6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 0.0552
     7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 0.0699
     8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2 0.0410
     9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 0.0439
    10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 0.0521
    # ℹ 22 more rows

    ::: :::

  • rename(.data, ...) 重命名列,rename_with() 可通过函数批量重命名

    ::: {.cell}

    mtcars |> rename(miles_per_gallon = mpg)

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 32 × 11
       miles_per_gallon   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
                  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1             21       6  160    110  3.9   2.62  16.5     0     1     4     4
     2             21       6  160    110  3.9   2.88  17.0     0     1     4     4
     3             22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
     4             21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
     5             18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
     6             18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
     7             14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
     8             24.4     4  147.    62  3.69  3.19  20       1     0     4     2
     9             22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
    10             19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
    # ℹ 22 more rows

    ::: :::

E.5 向量化函数

E.5.1 适用于 mutate()

E.5.2 位移操作

  • dplyr::lag() 向前位移
  • dplyr::lead() 向后位移

E.5.3 累积计算

  • dplyr::cumall() 累积逻辑与
  • dply::cumany() 累积逻辑或
  • cummax() 累积最大值
  • dplyr::cummean() 累积均值
  • cummin() 累积最小值
  • cumprod() 累积乘积
  • cumsum() 累积求和

E.5.4 排序计算

  • dplyr::cume_dist() 累积分布比例
  • dplyr::dense_rank() 紧密排名(无间隔)
  • dplyr::min_rank() 最小排名
  • dplyr::ntile() 分位数分组
  • dplyr::percent_rank() 标准化排名
  • dplyr::row_number() 行编号(首位优先)

E.5.5 数学运算

  • +, -, /, ^, %/%, %% 算术运算符
  • log(), log2(), log10() 对数函数
  • <, <=, >, >=, !=, == 比较运算符
  • dplyr::between() 区间判断
  • dplyr::near() 近似相等判断

E.5.6 扩展功能

  • dplyr::case_when() 多条件判断

    ::: {.cell}

    starwars |>
      mutate(type = case_when(
        height > 200 | mass > 200 ~ "large",
        species == "Droid" ~ "robot",
        TRUE ~ "other"
      ))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 87 × 15
       name     height  mass hair_color skin_color eye_color birth_year sex   gender
       <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
     1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
     2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
     3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
     4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
     5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
     6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
     7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
     8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
     9 Biggs D…    183    84 black      light      brown           24   male  mascu…
    10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
    # ℹ 77 more rows
    # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
    #   vehicles <list>, starships <list>, type <chr>

    ::: :::

  • dplyr::coalesce() 首有效值选取

  • dplyr::if_else() 元素级条件判断

  • dplyr::na_if() 特定值替换为 NA

  • pmax() 元素级最大值

  • pmin() 元素级最小值

E.6 汇总函数

E.6.1 适用于 summarize()

E.6.2 计数类

  • dplyr::n() 计数
  • dplyr::n_distinct() 唯一值计数
  • sum(!is.na()) 非空值计数

E.6.3 位置度量

  • mean() 均值
  • median() 中位数

E.6.4 逻辑运算

  • mean() 真值比例
  • sum() 真值数量

E.6.5 顺序提取

  • dplyr::first() 首元素
  • dplyr::last() 末元素
  • dplyr::nth() 指定位置元素

E.6.6 极值提取

  • quantile() 分位数
  • min() 最小值
  • max() 最大值

E.6.7 离散度量

  • IQR() 四分位距
  • mad() 绝对中位差
  • sd() 标准差
  • var() 方差

E.7 行名处理

整洁数据不使用行名(存储于列外部的变量)。需要行名操作时:

  • tibble::rownames_to_column() 行名转列

    ::: {.cell}

    a <- rownames_to_column(mtcars, var = "C")

    :::

  • tibble::columns_to_rownames() 列转行名

    ::: {.cell}

    column_to_rownames(a, var = "C")

    ::: {.cell-output .cell-output-stdout}

        mpg cyl  disp  hp drat    wt  qsec vs am gear carb
    1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
    2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
    3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
    4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
    5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
    6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
    7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
    8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
    9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
    10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
    11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
    12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
    13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
    14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
    15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
    16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
    17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
    18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
    19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
    20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
    21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
    22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
    23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
    24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
    25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
    26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
    27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
    28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
    29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
    30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
    31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
    32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

    ::: :::

  • 配套函数 tibble::has_rownames()tibble::remove_rownames()

E.8 表格合并

x <- tribble(
   ~A,  ~B, ~C,
  "a", "t",  1,
  "b", "u",  2,
  "c", "v",  3
)

y <- tribble(
   ~A,  ~B, ~D,
  "a", "t",  3,
  "b", "u",  2,
  "d", "w",  1
)

E.8.1 横向合并

  • bind_cols(..., .name_repair) 横向拼接表格(不自动匹配列名)

E.8.2 纵向叠加

  • bind_rows(..., .id = NULL) 纵向合并,.id 参数可添加来源标识列

E.8.3 关系连接

扩展连接通过匹配键合并表格列:

  • left_join(x, y) 左连接
  • right_join(x, y) 右连接
  • inner_join(x, y) 内连接
  • full_join(x, y) 全连接

过滤连接按另一个表筛选当前表:

  • semi_join(x, y) 交集筛选
  • anti_join(x, y) 差集筛选

嵌套连接

  • nest_join(x, y) 内连接结果嵌套存储

E.8.4 连接键设置

  • 多列匹配:by = join_by(col1, col2)

    ::: {.cell}

    left_join(x, y, by = join_by(A))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 3 × 5
      A     B.x       C B.y       D
      <chr> <chr> <dbl> <chr> <dbl>
    1 a     t         1 t         3
    2 b     u         2 u         2
    3 c     v         3 <NA>     NA

    :::

    left_join(x, y, by = join_by(A, B))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 3 × 4
      A     B         C     D
      <chr> <chr> <dbl> <dbl>
    1 a     t         1     3
    2 b     u         2     2
    3 c     v         3    NA

    ::: :::

  • 跨名匹配:by = join_by(col1 == col2)

    ::: {.cell}

    left_join(x, y, by = join_by(C == D))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 3 × 5
      A.x   B.x       C A.y   B.y  
      <chr> <chr> <dbl> <chr> <chr>
    1 a     t         1 d     w    
    2 b     u         2 b     u    
    3 c     v         3 a     t    

    ::: :::

  • 同名后缀参数:suffix

    ::: {.cell}

    left_join(x, y, by = join_by(C == D), suffix = c("1", "2"))

    ::: {.cell-output .cell-output-stdout}

    # A tibble: 3 × 5
      A1    B1        C A2    B2   
      <chr> <chr> <dbl> <chr> <chr>
    1 a     t         1 d     w    
    2 b     u         2 b     u    
    3 c     v         3 a     t    

    ::: :::

E.8.5 集合运算

  • intersect(x, y) 交集
  • setdiff(x, y) 差集
  • union(x, y) 并集(去重)
  • setequal() 集合相等判断
# 设置随机种子
set.seed(123)

# 生成两个随机字母向量
x = sample(letters, 5)
y = sample(letters, 8)
print(x)
[1] "o" "s" "n" "c" "j"
print(y)
[1] "r" "v" "k" "e" "t" "n" "w" "s"
# 交集  
intersect(x, y)
[1] "s" "n"
# 差集
setdiff(x, y)
[1] "o" "c" "j"
# 并集
union(x, y)
 [1] "o" "s" "n" "c" "j" "r" "v" "k" "e" "t" "w"
# 集合相等
setequal(x, y)
[1] FALSE