Input and Output in R
Posted on Nov 09, 2012 in Programming
Things under legendu.net/outdated are outdated technologies that the author does not plan to update any more. Please look for better alternatives.
read.table, read.csv, read.delim, etc.
-
It is suggested that you use
read.csv/read.delimand avoid usingread.table. This is because that thecomment.charis off forread.csv/read.delimand on forread.tableby default. My experience is that data seldom has comments in them and people are usually not aware thatcomment.charis on forread.tableby default. When there are "comment chararacters" in data which are not intended to be comment characters, you screw up if you read data usingread.tableand forget to turn off comment usingcomment.char = ''. -
By default, lines starting with
#are treated as comment lines and thus are ignored when reading data usingread.table, etc. If a line is not a comment line, avoid starting it with#. Or you can specify a different comment character (e.g.,$using optioncomment.char = "$"). -
It is suggested that you always use
stringsAsFactors = FALSEwhen constructing a data frame (data.frame, as.data.frame, read.table, read.csv, etc.). Factors causes more troubles than conveniences. If you do need factors (for building models), just manual convert columns to factors. -
If a data row have an extra column/field (e.g., due to missing quotes in CSV format), R will not throw error but instead treats the data to have a row names column. This results in shifted columns in some (shorter) rows. You can examining the format of a data file using the function
count.fieldswhich counts the number of fields in each line. -
read.csvfor standard CSV files whileread.csv2is for European CSV format.
write.table, write.csv, write.delim, etc.
-
It is suggested that you write data frames into CSV format. CSV format has a clear definition. If you tell people that a file is in CSV format, they know how to read it. But if you tell people that a file is in a general delimited format, people have to ask about what the delimiter is and whether fields are quoted, etc.
-
The function
write.tablewrites a data frame or a matrix into a file. Note that it can also append data into a file.
x = matrix(1:24, nrow = 6)
colnames(x) = paste0("x", 1:4)
write.table(x, 'out.txt')
x[, ] = 0
write.table(x, 'out.txt', append = T, col.names = F)
-
By default
write.tableoutput missing values as character NA (na = "NA"). It is suggested that you output missing values as empty strings (na = "") as it is more portable. Other programming languages do not recognize characterNA'sas missing values. -
Always quote fields when you write data into CSV format.
-
row.nameshave different means inread.tableandwrite.table.row.namesis a logical variable indicating whether you want to output row names or not inwrite.table, however, it is not a logical variable indicating whether there is a row names in the data. It is much more complicated. Please refer to the R help doc for detailed explanation. -
It is suggested that you never write row names into files. If row names contains useful informtion, write row names into file as a column/field. The reason is that row names causes troubles in IO. On one hand, it is not convenient to read in a file with row names into other programming language. On another hand, row names causes side effect in R too.
MS Excel
http://www.thertrader.com/2014/02/11/a-million-ways-to-connect-r-and-excel/ It is suggested that you avoid using Excel as input/ouput data format. CSV is a better alternative. However, if you do have to use Excel as input/output data format, read the following tips.
-
There are lots of ways to read data from and write data into Excel documents. For example, packages
xlsReadWrite,xlsx,RODBC(and many more) all offers ways to import data from and export data to Excel documents.xlsx(which offersread.xlsxandwrite.xlsx) is good package for dealing with Excel spreadsheet.RODBCis a universal way to deal with all kinds of databases (not just Excel spreadsheet). -
It is usually very slow to read in (or write to) a large Excel spreadsheet. It is suggested that you convert large Excel spreadsheets to CSV files first and then read in them. Also, write data into CSV files instead of Excel spreadsheet. However, be careful that the stupid Excel might loss information when converting to CSV format. Generally speaking, this happens when there are very long numbers.
-
Excel might scilently format opened CSV (or imported text) files. Generally speaking, this happens when there are very long numbers in the text file. Be careful not to introduce undesired changed. If you just open a CSV file to view it and do not want change its content, then just discard any changes Excel has made. If you indeed want to change CSV file in Excel, keep your fingers cross.
Binary Data
- The function
readBinreads in binary data and the functionwriteBinwrites binary data into files. Both of the two functions have limits on the size of data that they can deal at a time. For large binary data, you have to usereadBin/writeBinmultiple times to read/write them from/to files.readBinandwriteBinare compatible withfwriteandfreadin MATLAB.freadandfwriteare C style functions, ... (to be checked for compatible with c/c++). However,readBinandwriteBinis not compatible with Java classesDataOutputStreamandDataInputStream. To read binary data written usingDataOutputStreamfrom Java, you can call Java code for reading binary data usingrJavain R; to write binary data that is recognized byDataInputStreamin Java, you can call Java code for writing binary data usingrJavain R.
Misc
- All most all input/output functions in R support reading data from all kinds of source including files, console, clipboard and website. For example, if you have copied a block of data from an Excel document, you can read it into R using the following command.
x = read.table('clipboard', sep = '\t')
This is good and quick way to import part of the data from a Excel document into in R.
You do not have to worry about formula in cells.
When you read data from Excel, values (instead formulas) are read in.
If the data is on a website,
you can just pass the URL to read.table (or read.delim, read.csv, etc.) to read it.
-
printprints an R object to the R console andcatcan print multiple objects to the R console, clipboard or files. There is some difference between the outputs generated by these two functions. Generally speaking, if you want to see the content of an object in R scenario (keep special characters as they are), you want to useprint; if you want to see the content in human readable format (special characters are translated), you want to use functioncat. I have to mention thatcatdoesn't work for all types of R objects (e.g. objects ofxtable). -
The function
scanis extremely powerful and flexible. You can skip firstklines using the optionskip = k. For example, the following command skips the first 5 lines.
scan(what = double(), skip = 5)
You can skip lines starting with special characters using the option comment.char.
For example,
the following command skips lines starting with "#".
scan(what = double(), comment.char = "#")
You can decide which special strings are treated as missing values using the option na.string.
For example,
the following command treats "NA" as missing values.
scan(what = double(), na.string = "NA")
You can limit the number of data values/lines to be read in, etc.
Many input functions in R are based on the function scan,
for example, read.table.
These functions inherits the powerfulness and flexibility of scan.
scan is also helpful for interactive programming.
For example, you can read a string into the variable input using the following commmand.
scan(what = character(), n = 1) -> input
However, scan is used to read in a vector,
so the data to be read in at the same time must be of the same type.