VPSA All Flash Array - Data Reduction Estimator Tool, User Guide – Zadara Support

Preface

Data Reduction Estimator is a Zadara proprietary tool which is provided to assist Zadara's customers and partners in estimating the expected data reduction ratio that an All-Flash VPSA may achieve on a given data set. It is recommended that this be used on pre-AFA VPSA data sets to gain an idea of potential data reduction sizes before migrating the data set into an AFA VPSA.

Note: This tool is an estimator based on mathematical and statistical calculations. It provides good estimates in most cases, but can’t guarantee actual data reduction results so please use as an indicator.

Requirements

OS:
- Ubuntu 16.04 and later
- CentOS 7.5
- Windows Server 2016/2019
Data types:
- Block/NAS (both NFS/SMB)
- Hypervisor's Datastorage (i.e. ESXi Datastore) should be attached as a read-only block volume to the Linux system that will run the tool.
- In all cases, read-only access is required.
Memory:
- The recommended memory requirement to run the estimator depends on the size of the data set and the number of files. To estimate the compute memory required, use the following guidelines:
  - 4MB per 1GB NAS (assuming average file size of 128KiB)
  - 200KB per 1GB for SAN
- By default, the tool will not use more than 80% of the available physical memory, although this can be overridden from the command line.
Run duration:
- The tool will run until it completes scanning and analyzing the data set(s).
- Scanning time is limited to 1 hour (default).
- The tool can be stopped at any time. Data reduction estimation is 0 during scanning phase. During the analyzing phase the estimation accuracy improves. The longer the tool runs the more accurate the estimation would be
- IO Load on the source volume
  You can control the IO load of the tool on the source data Volume by selecting the number of concurrent threads (--threads) that read from the Volume.
  By default the tool uses 2 x <number_of_cpus> concurrent threads. Increasing number of threads will make the tool faster while increasing the read workload.

How it works

The technique used to estimate data reduction ratio is based on breakthrough theoretical work by "Valiant and Valiant" from a 2011 paper entitled “Estimating the unseen”; that give a provably accurate method for estimating various measures while seeing only a fraction of the data.

The tool works in two phases:

Scanning - during this phase sampled data is read. Since no analysis is done, there is no estimation output. Scanning time can be limited (defaults to 1 hour).
Analyzing - Data is analyzed for deduplication, compression and zero pattern matching. Output contains the live analysis progress (completion percentage) and updated data reduction estimate.

Data Reduction estimation guidelines

The underlying storage pool type impacts data deduplication estimation. Use the correct chunk size. This can be derived from your storage arrays Pool type for a source volume.
Guideline for Zadara's VPSA Storage Array:
- Transactional pool - chunk size=8k
- Repository pool - chunk size=16k
- Archival pool - chunk size=32k
Use the --estimate parameter to select what data reduction technique to estimate - Dedup, Compression, or both. Set no-dedup-zeros to ignore zeros in a block Volume. This will provide data reduction estimation of actual data.

Downloading the tool

The "Data reduction Estimator Tool" is a single binary, standalone utility available with both Linux and Windows support.

It is publicly available:

zrest-linux-1.08.tgz (MD5 Checksum: 864FA5F5A3E4A42D675CE11242DE8A44)

zrest-windows-1.08.zip (MD5 Checksum: 929CE304ED6BA2CF92346A81EFBC3846)

Configuring the tool

NOTE: Windows users should open command prompt with elevated privileges.

The tool can run with different options and parameters, run it with the flag -h to see its options:

$ ./zrest -h

Usage:
./zrest
 --path | -p <path> [-p <path> ...] One or more input paths.
 [--chunk_size | -c <chunk_size>] Chunk size. Default=16.000 KiB.
 [--estimate | -e <[no-]dedupe|no-dedupe-zeros|[no-]compress|all>] Estimation mode. Default=all.
 [--freq | -r <report_freq>] Report frequency in percents. Default=1.000.
 [--threads | -t <#threads>] Number of threads. Default=2*number_of_cpus (4).
 [--scan_time | -T <scan_time_limit>] Scan time limit in seconds. Default=3600. Set 0 for no limit.
 [--memory | -M <memory_limit>] Memory limit. Default=80% of available memory. (1.287 GiB). Set 0 for no limit.
 [--out | -o <outfile>] Output file.
 [--help | -h]
 [--version | -V]

Size parameters accept suffixes K(KiB), M(MiB), G(GiB), T(TiB).

--path or -p : to indicate the path of the required volume, use a space between paths if you intend to run the tool for multiple volumes.
NOTE: Windows users - the tool can work with logical drive letter or full UNC path.

--chunk_size or -c : compression and dedupe ratios will be calculated regarding this chunk size, the default is 16k(Repository pool). Other options are 8k and 32k.

For example: ./zrest -p /mnt/share -c 8k

--estimate or -e : using this parameter, you can decide what results to estimate, it is possible to estimate only dedupe or only compress. It is also possible to ignore zeros for dedupe estimation.

The default is all, meaning dedupe (with zeros) and compression will be calculated.

dedupe, compression and zeros calculation can be ignored by:

no-dedupe | no-compress | no-dedupe-zeros

For example, to start tool without compression : ./zrest -p /mnt/share -c 8k -e no-compress

--freq or -r : report frequency in percents, default is 1.

For example, to run the tool with report frequency 5%: ./zrest -p /mnt/share -c 8k -r 5

--path or -p : to indicate the path of the required volume, use space between paths if you intend to run the tool over multiple volumes.

--chunk_size or -c :compression and dedupe ratios will be calculated regarding this chunk size, default is 16k, can run with 8k, 16k and 32k.

For example: ./zrest -p /mnt/share -c 8k

--estimate or -e : using this parameter, you can decide what to estimate, it is possible to estimate only dedupe or only compress. Also it is possible to ignore zeros for dedupe estimation.

Default is all, meaning dedupe (with zeros) and compression will be calculated.

dedupe, compression and zeros calculation can be ignored by:

no-dedupe | no-compress | no-dedupe-zeros

For example, to start tool without compression : ./zrest -p /mnt/share -c 8k -e no-compress

--freq or -r : report frequency in percents, default is 1.

For example, to run the tool with report frequency 5%: ./zrest -p /mnt/share -c 8k -r 5

--threads or -t : number of threads. Default=2*number_of_cpus .

For example, to run the tool with 10 threads: ./zrest -p /mnt/share -c 8k -t 10

--scan_time or -T : scan time limit in seconds. Default=3600. Set 0 for no limit.

The Scanning phase will stop when this limit is reached and Analyze phase will start on the data that was scanned. Increasing the scanning and the analyzing time increases accuracy.

For example, to limit scanning to 2 hours: ./zrest -p /mnt/share -c 8k -T 7200

--out or -o : Output file path, output will be saved in the chosen file

For example, to save the tool output to /tmp/out.txt: ./zrest -p /mnt/share -c 8k -o /tmp/out.txt

--memory or -M : use this option to limit the memory that will be used by the tool. Default=80% of available VC Physical memory. Set 0 for no limit.

For example, to limit memory usage by 5GiB: ./zrest -p /mnt/share -c 8k -M 5G

--help or -h : use to show all the options and parameters.

--version or -V : Show version of the tool.--threads or -t : number of threads. Default=2*number_of_cpus .