RxXdfData: Generate Xdf Data Source Object

Description

This is the main generator for S4 class RxXdfData, which extends RxDataSource.

Usage

  RxXdfData(file, varsToKeep = NULL, varsToDrop = NULL, returnDataFrame = TRUE,
          stringsAsFactors = FALSE, blocksPerRead = rxGetOption("blocksPerRead"),
          fileSystem = NULL, createCompositeSet = NULL, createPartitionSet = NULL,
          blocksPerCompositeFile = 3) 

 ## S3 method for class `RxXdfData':
head  (x, n = 6L, reportProgress = 0L, ...)

 ## S3 method for class `RxXdfData':
summary  (object, ...)

 ## S3 method for class `RxXdfData':
tail  (x, n = 6L, addrownums = TRUE, reportProgress = 0L, ...)


Arguments

file

character string specifying the location of the data. For single Xdf, it is a .xdf file. For composite Xdf, it is a directory like /tmp/airline. When using distributed compute contexts like RxSpark, a directory should be used since those compute contexts always use composite Xdf.

varsToKeep

character vector of variable names to keep around during operations. If NULL, argument is ignored. Cannot be used with varsToDrop.

varsToDrop

character vector of variable names to drop from operations. If NULL, argument is ignored. Cannot be used with varsToKeep.

returnDataFrame

logical indicating whether or not to convert the result to a data frame when reading with rxReadNext. If FALSE, a list is returned when reading with rxReadNext.

stringsAsFactors

logical indicating whether or not to convert strings into factors in R (for reader mode only). It currently has no effect.

blocksPerRead

number of blocks to read for each chunk of data read from the data source.

fileSystem

character string or RxFileSystem object indicating type of file system; "native"or RxNativeFileSystem object can be used for the local operating system, or an RxHdfsFileSystem object for the Hadoop file system. If NULL, the file system will be set to that in the current compute context, if available, otherwise the fileSystem option.

createCompositeSet

logical value or NULL. Used only when writing. If TRUE, a composite set of files will be created instead of a single .xdf file. Subdirectories data and metadata will be created. In the data subdirectory, the data will be split across a set of .xdfd files (see blocksPerCompositeFile below for determining how many blocks of data will be in each file). In the metadata subdirectory there is a single .xdfm file, which contains the meta data for all of the .xdfd files in the data subdirectory. When the compute context is RxHadoopMR or RxSpark, a composite set of files are always created.

createPartitionSet

logical value or NULL. Used only when writing. If TRUE, a set of files for partitioned Xdf will be created when assigning this RxXdfData object for outData of rxPartition. Subdirectories data and metadata will be created. In the data subdirectory, the data will be split across a set of .xdf files (each file stores data of a single data partition, see rxPartition for details). In the metadata subdirectory there is a single .xdfp file, which contains the meta data for all of the .xdf files in the data subdirectory. The partitioned Xdf object is currently supported only in rxPartition and rxGetPartitions

blocksPerCompositeFile

integer value. If createCompositeSet=TRUE, and if the compute context is not RxHadoopMR, this will be the number of blocks put into each .xdfd file in the composite set. When importing is being done on Hadoop using MapReduce, the number of rows per .xdfd file is determined by the rows assigned to each MapReduce task, and the number of blocks per .xdfd file is therefore determined by rowsPerRead.

x

an RxXdfData object

object

an RxXdfData object

n

positive integer. Number of rows of the data set to extract.

addrownums

logical. If TRUE, row numbers will be created to match the original data set.

reportProgress

integer value with options:

  • 0: no progress is reported.
  • 1: the number of processed rows is printed and updated.
  • 2: rows processed and timings are reported.
  • 3: rows processed and all timings are reported.

...

arguments to be passed to underlying functions

Value

object of class RxXdfData.

Author(s)

Microsoft Corporation Microsoft Technical Support

See Also

RxXdfData-class, rxNewDataSource, rxOpen, rxReadNext.

Examples


 myDataSource <- RxXdfData(file.path(rxGetOption("sampleDataDir"), "claims"))
 # both of these should return TRUE
 is(myDataSource, "RxXdfData")
 is(myDataSource, "RxDataSource")

 names(myDataSource)

 modelFormula <- formula(myDataSource, depVars = "cost", varsToDrop = "RowNum")