热点新闻
数据文件智能读取: R语言vroom包
2023-07-14 23:42  浏览:5622  搜索引擎搜索“微商筹货网”
温馨提示:信息一旦丢失不一定找得到,请务必收藏信息以备急用!本站所有信息均是注册会员发布如遇到侵权请联系文章中的联系方式或客服删除!
联系我时,请说明是在微商筹货网看到的信息,谢谢。
展会发布 展会网站大全 报名观展合作 软文发布

最近折腾Shiny的时候接触到了一款非常好用的数据读取包。写一下备忘录。

1. 自动识别分隔文件

vroom有自动识别文件格式功能,所以不管是csv,还是tsv文件都只需要同一个读取指令vroom(”xxx.csv”)就可以。

library(vroom) data <- vroom("flights.tsv") #> Observations: 336,776 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

会跳出来一大段有关该数据各列属性的信息,不需要的话可以关掉。

s <- spec(data) data <- vroom("flights.tsv", col_types = s)

2. 同时读取多个文件

批量读取数据是vroom的一大亮点。

files <- fs::dir_ls(glob = "flights_*tsv") files #> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv #> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv #> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv #> flights_YV.tsv data <- vroom(files) #> Observations: 336,776 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

3. 读取和写出压缩文件

  • vroom_write() 可以直接写出压缩文件

vroom_write(flights, "flights.tsv.gz") # Check file sizes to show file is compressed fs::file_size(c("flights.tsv", "flights.tsv.gz")) #> 29.62M 7.87M # Read the file back in data <- vroom("flights.tsv.gz") #> Observations: 336,776 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

4. 读取网页文件

file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv" data <- vroom(file) #> Observations: 32 #> Variables: 12 #> chr [ 1]: model #> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

5. 读取和写出管道代码连接数据

这个有点神奇的,完全代替Perl。

  • 提取United Airlines(包含UA字符)的数据

# Return only flights on United Airlines data <- vroom(pipe("grep -w UA flights.tsv"), col_names = names(flights)) #> Observations: 58,665 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

  • 又或者可以在写出压缩文件的时候指定压缩工具pigz

bench::workout({ vroom_write(flights, "flights.tsv.gz") vroom_write(flights, pipe("pigz > flights.tsv.gz")) }) #> # A tibble: 2 x 3 #> exprs process real #> <bch:expr> <bch:tm> <bch:tm> #> 1 vroom_write(flights, "flights.tsv.gz") 3.5s 2.69s #> 2 vroom_write(flights, pipe("pigz > flights.tsv.gz")) 1.54s 975.09ms

6. 选择数据列

  • 提取指定列

data <- vroom("flights.tsv", col_select = c(year, flight, tailnum)) #> Observations: 336,776 #> Variables: 3 #> chr [1]: tailnum #> dbl [2]: year, flight #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

  • 不提取指定列

data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour)) #> Observations: 336,776 #> Variables: 13 #> chr [4]: carrier, tailnum, origin, dest #> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr... #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message

  • 重命名指定列

data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything())) #> Observations: 336,776 #> Variables: 19 #> chr [ 4]: carrier, tailnum, origin, dest #> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr... #> dttm [ 1]: time_hour #> #> Call `spec()` for a copy-pastable column specification #> Specify the column types with `col_types` to quiet this message data #> # A tibble: 336,776 x 19 #> plane year month day dep_time sched_dep_time dep_delay arr_time #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 N142… 2013 1 1 517 515 2 830 #> 2 N242… 2013 1 1 533 529 4 850 #> 3 N619… 2013 1 1 542 540 2 923 #> 4 N804… 2013 1 1 544 545 -1 1004 #> 5 N668… 2013 1 1 554 600 -6 812 #> 6 N394… 2013 1 1 554 558 -4 740 #> 7 N516… 2013 1 1 555 600 -5 913 #> 8 N829… 2013 1 1 557 600 -3 709 #> 9 N593… 2013 1 1 557 600 -3 838 #> 10 N3AL… 2013 1 1 558 600 -2 753 #> # … with 336,766 more rows, and 11 more variables: sched_arr_time <dbl>, #> # arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm>

7. 修改变量属性

大多数情况下vroom可以准确的判断变量属性,当然偶尔也会出错,这个时候可以手动指定。当然也可以后期用dplyr 改,当然这样做就会稍微麻烦点。

属性对照,[ ]里的字符是实际用到的缩写字符。

  • col_logical() ‘l’, containing only T, F, TRUE, FALSE, 1 or 0.
  • col_integer() ‘i’, integer values.
  • col_double() ‘d’, floating point values.
  • col_number() [n], numbers containing the grouping_mark
  • col_date(format = "") [D]: with the locale’s date_format.
  • col_time(format = "") [t]: with the locale’s time_format.
  • col_datetime(format = "") [T]: ISO8601 date times.
  • col_factor(levels, ordered) ‘f’, a fixed set of values.
  • col_character() ‘c’, everything else.
  • col_skip() ‘_, -', don’t import this column.
  • col_guess() ‘?', parse using the “best” type based on the input.

用例如下:

# read the 'year' column as an integer data <- vroom("flights.tsv", col_types = c(year = "i")) # also skip reading the 'time_hour' column data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_")) # also read the carrier as a factor data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_", carrier = "f"))

data <- vroom("flights.tsv", col_types = list(year = col_integer(), time_hour = col_skip(), carrier = col_factor()) )

8. 数据读取速度

一个字,快!非常适合机器学习动不动就几个G的数据。

下图是读取和输出1.55G数据时各个包所用的时间比较。







发布人:fffc****    IP:139.201.89.***     举报/删稿
展会推荐
让朕来说2句
评论
收藏
点赞
转发