数据文件智能读取: R语言vroom包
2023-07-14 23:42  浏览:5622  搜索引擎搜索“微商筹货网”
展会发布 展会网站大全 报名观展合作 软文发布


1. 自动识别分隔文件


library(vroom) data <- vroom("flights.tsv") #> Observations: 336,776 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message


s <- spec(data) data <- vroom("flights.tsv", col_types = s)

2. 同时读取多个文件


files <- fs::dir_ls(glob = "flights_*tsv") files #> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv #> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv #> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv #> flights_YV.tsv data <- vroom(files) #> Observations: 336,776 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

3. 读取和写出压缩文件

  • vroom_write() 可以直接写出压缩文件

vroom_write(flights, "flights.tsv.gz") # Check file sizes to show file is compressed fs::file_size(c("flights.tsv", "flights.tsv.gz")) #> 29.62M 7.87M # Read the file back in data <- vroom("flights.tsv.gz") #> Observations: 336,776 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

4. 读取网页文件

file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv" data <- vroom(file) #> Observations: 32 #> Variables: 12 #> chr [ 1]: model #> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

5. 读取和写出管道代码连接数据


  • 提取United Airlines(包含UA字符)的数据

# Return only flights on United Airlines data <- vroom(pipe("grep -w UA flights.tsv"), col_names = names(flights)) #> Observations: 58,665 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

  • 又或者可以在写出压缩文件的时候指定压缩工具pigz

bench::workout({ vroom_write(flights, "flights.tsv.gz") vroom_write(flights, pipe("pigz > flights.tsv.gz")) }) #> # A tibble: 2 x 3 #> exprs process real #> <bch:expr> <bch:tm> <bch:tm> #> 1 vroom_write(flights, "flights.tsv.gz") 3.5s 2.69s #> 2 vroom_write(flights, pipe("pigz > flights.tsv.gz")) 1.54s 975.09ms

6. 选择数据列

  • 提取指定列

data <- vroom("flights.tsv", col_select = c(year, flight, tailnum)) #> Observations: 336,776 #> Variables: 3 #> chr [1]: tailnum #> dbl [2]: year, flight #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

  • 不提取指定列

data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour)) #> Observations: 336,776 #> Variables: 13 #> chr [4]: carrier, tailnum, origin, dest #> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr... #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

  • 重命名指定列

data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything())) #> Observations: 336,776 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message data #> # A tibble: 336,776 x 19 #> plane year month day dep_time sched_dep_time dep_delay arr_time #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 N142… 2013 1 1 517 515 2 830 #> 2 N242… 2013 1 1 533 529 4 850 #> 3 N619… 2013 1 1 542 540 2 923 #> 4 N804… 2013 1 1 544 545 -1 1004 #> 5 N668… 2013 1 1 554 600 -6 812 #> 6 N394… 2013 1 1 554 558 -4 740 #> 7 N516… 2013 1 1 555 600 -5 913 #> 8 N829… 2013 1 1 557 600 -3 709 #> 9 N593… 2013 1 1 557 600 -3 838 #> 10 N3AL… 2013 1 1 558 600 -2 753 #> # … with 336,766 more rows, and 11 more variables: sched_arr_time <dbl>, #> # arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm>

7. 修改变量属性

大多数情况下vroom可以准确的判断变量属性,当然偶尔也会出错,这个时候可以手动指定。当然也可以后期用dplyr 改,当然这样做就会稍微麻烦点。

属性对照,[ ]里的字符是实际用到的缩写字符。

  • col_logical() ‘l’, containing only T, F, TRUE, FALSE, 1 or 0.
  • col_integer() ‘i’, integer values.
  • col_double() ‘d’, floating point values.
  • col_number() [n], numbers containing the grouping_mark
  • col_date(format = "") [D]: with the locale’s date_format.
  • col_time(format = "") [t]: with the locale’s time_format.
  • col_datetime(format = "") [T]: ISO8601 date times.
  • col_factor(levels, ordered) ‘f’, a fixed set of values.
  • col_character() ‘c’, everything else.
  • col_skip() ‘_, -', don’t import this column.
  • col_guess() ‘?', parse using the “best” type based on the input.


# read the 'year' column as an integer data <- vroom("flights.tsv", col_types = c(year = "i")) # also skip reading the 'time_hour' column data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_")) # also read the carrier as a factor data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_", carrier = "f"))

data <- vroom("flights.tsv", col_types = list(year = col_integer(), time_hour = col_skip(), carrier = col_factor()) )

8. 数据读取速度



发布人:fffc****    IP:139.201.89.***     举报/删稿