rx_summary

Usage

revoscalepy.rx_summary(formula: str, data, by_group_out_file=None,
    summary_stats: list = None, by_term: bool = True, pweights=None,
    fweights=None, row_selection: str = None, transforms=None,
    transform_objects=None, transform_function=None, transform_variables=None,
    transform_packages=None, transform_environment=None,
    overwrite: bool = False, use_sparse_cube: bool = None,
    remove_zero_counts: bool = None, blocks_per_read: int = None,
    rows_per_block: int = 100000, report_progress: int = None,
    verbose: int = 0, compute_context=None, **kwargs)

Description

Produce univariate summaries of objects in revoscalepy.

Arguments

formula

Statistical model using symbolic formulas. The formula typically does not contain a response variable, i.e. it should be of the form ~ terms.

data

either a data source object, a character string specifying a ‘.xdf’ file, or a data frame object to summarize. If a Spark compute context is being used, this argument may also be an RxHiveData, RxOrcData, RxParquetData or RxSparkDataFrame object or a Spark data frame object from pyspark.sql.DataFrame.

by_group_out_file

None, a character string or vector of character strings specifying .xdf file names(s), or an RxXdfData object or list of RxXdfData objects. If not None, and the formula includes computations by factor, the by-group summary results will be written out to one or more ‘.xdf’ files. If more than one .xdf file is created and a single character string is specified, an integer will be appended to the base by_group_out_file name for additional file names. The resulting RxXdfData objects will be listed in the categorical component of the output object.

summary_stats

A list of strings containing one or more of the following values: “Mean”, “StdDev”, “Min”, “Max”, “ValidObs”, “MissingObs”, “Sum”.

by_term

bool variable. If True, missings will be removed by term (by variable or by interaction expression) before computing summary statistics. If False, observations with missings in any term will be removed before computations.

pweights

Character string specifying the variable to use as probability weights for the observations.

fweights

Character string specifying the variable to use as frequency weights for the observations.

row_selection

None. Not currently supported, reserved for future use.

transforms

None. Not currently supported, reserved for future use.

transform_objects

None. Not currently supported, reserved for future use.

transform_function

Variable transformation function.

transform_variables

List of strings of input data set variables needed for the transformation function.

transform_packages

None. Not currently supported, reserved for future use.

transform_environment

None. Not currently supported, reserved for future use.

overwrite

Bool value. If True, an existing byGroupOutFile will be overwritten. overwrite is ignored byGroupOutFile is None.

use_sparse_cube

Bool value. If True, sparse cube is used.

remove_zero_counts

Bool flag. If True, rows with no observations will be removed from the output for counts of categorical data. By default, it has the same value as useSparseCube. For large summary computation, this should be set to True, otherwise the Python interpreter may run out of memory even if the internal C++ computation succeeds.

blocks_per_read

Number of blocks to read for each chunk of data read from the data source.

rows_per_block

Maximum number of rows to write to each block in the by_group_out_file (if it is not None).

report_progress

Integer value with options: 0: No progress is reported. 1: The number of processed rows is printed and updated. 2: Rows processed and timings are reported. 3: Rows processed and all timings are reported.

verbose

Integer value. If 0, no additional output is printed. If 1, additional summary information is printed.

compute_context

A valid RxComputeContext object.

kwargs

Additional arguments to be passed directly to the Revolution Compute Engine.

Returns

An RxSummary object containing the following elements: nobs.valid: Number of valid observations. nobs.missing: Number of missing observations. sDataFrame: Data frame containing summaries for continuous variables. categorical: List of summaries for categorical variables. categorical.type: Types of categorical summaries: can be “counts”, or “cube” (for crosstab counts) or “none” (if there is no categorical summaries). formula: Formula used to obtain the summary.

Example

import os
from revoscalepy import rx_summary, RxOptions, RxXdfData
sample_data_path = RxOptions.get_option("sampleDataDir")
ds = RxXdfData(os.path.join(sample_data_path, "AirlineDemoSmall.xdf"))
summary = rx_summary("ArrDelay+DayOfWeek", ds)
print(summary)