Preface
Data Reduction Estimator is a Zadara proprietary tool which is provided to assist Zadara's customers and partners in estimating the expected data reduction ratio that an All-Flash VPSA may achieve on a given data set. It is recommended that this be used on pre-AFA VPSA data sets to gain an idea of potential data reduction sizes before migrating the data set into an AFA VPSA.
Note: This tool is an estimator based on mathematical and statistical calculations. It provides good estimates in most cases, but can’t guarantee actual data reduction results so please use as an indicator.
Requirements
- OS:
- Ubuntu 16.04 and later
- CentOS 7.5
- Windows Server 2016/2019
- Data types:
- Block/NAS (both NFS/SMB)
- Hypervisor's Datastorage (i.e. ESXi Datastore) should be attached as a read-only block volume to the Linux system that will run the tool.
- In all cases, read-only access is required.
- Memory:
- The recommended memory requirement to run the estimator depends on the size of the data set and the number of files. To estimate the compute memory required, use the following guidelines:
- 4MB per 1GB NAS (assuming average file size of 128KiB)
- 200KB per 1GB for SAN
- By default, the tool will not use more than 80% of the available physical memory, although this can be overridden from the command line.
- The recommended memory requirement to run the estimator depends on the size of the data set and the number of files. To estimate the compute memory required, use the following guidelines:
- Run duration:
- The tool will run until it completes scanning and analyzing the data set(s).
- Scanning time is limited to 1 hour (default).
- The tool can be stopped at any time. Data reduction estimation is 0 during scanning phase. During the analyzing phase the estimation accuracy improves. The longer the tool runs the more accurate the estimation would be
- IO Load on the source volume
You can control the IO load of the tool on the source data Volume by selecting the number of concurrent threads (--threads) that read from the Volume.
By default the tool uses 2 x <number_of_cpus> concurrent threads. Increasing number of threads will make the tool faster while increasing the read workload.
How it works
The technique used to estimate data reduction ratio is based on breakthrough theoretical work by "Valiant and Valiant" from a 2011 paper entitled “Estimating the unseen”; that give a provably accurate method for estimating various measures while seeing only a fraction of the data.
The tool works in two phases:
- Scanning - during this phase sampled data is read. Since no analysis is done, there is no estimation output. Scanning time can be limited (defaults to 1 hour).
- Analyzing - Data is analyzed for deduplication, compression and zero pattern matching. Output contains the live analysis progress (completion percentage) and updated data reduction estimate.
Data Reduction estimation guidelines
- The underlying storage pool type impacts data deduplication estimation. Use the correct chunk size. This can be derived from your storage arrays Pool type for a source volume.
Guideline for Zadara's VPSA Storage Array:
- Transactional pool - chunk size=8k
- Repository pool - chunk size=16k
- Archival pool - chunk size=32k
- Use the --estimate parameter to select what data reduction technique to estimate - Dedup, Compression, or both. Set no-dedup-zeros to ignore zeros in a block Volume. This will provide data reduction estimation of actual data.
Downloading the tool
The "Data reduction Estimator Tool" is a single binary, standalone utility available with both Linux and Windows support.
It is publicly available:
zrest-linux-1.08.tgz (MD5 Checksum: 864FA5F5A3E4A42D675CE11242DE8A44)
zrest-windows-1.08.zip (MD5 Checksum: 929CE304ED6BA2CF92346A81EFBC3846)
Configuring the tool
NOTE: Windows users should open command prompt with elevated privileges.
The tool can run with different options and parameters, run it with the flag -h to see its options:
$ ./zrest -h
Usage:
./zrest
--path | -p <path> [-p <path> ...] One or more input paths.
[--chunk_size | -c <chunk_size>] Chunk size. Default=16.000 KiB.
[--estimate | -e <[no-]dedupe|no-dedupe-zeros|[no-]compress|all>] Estimation mode. Default=all.
[--freq | -r <report_freq>] Report frequency in percents. Default=1.000.
[--threads | -t <#threads>] Number of threads. Default=2*number_of_cpus (4).
[--scan_time | -T <scan_time_limit>] Scan time limit in seconds. Default=3600. Set 0 for no limit.
[--memory | -M <memory_limit>] Memory limit. Default=80% of available memory. (1.287 GiB). Set 0 for no limit.
[--out | -o <outfile>] Output file.
[--help | -h]
[--version | -V]
Size parameters accept suffixes K(KiB), M(MiB), G(GiB), T(TiB).
--path or -p : to indicate the path of the required volume, use a space between paths if you intend to run the tool for multiple volumes.
NOTE: Windows users - the tool can work with logical drive letter or full UNC path.
--chunk_size or -c : compression and dedupe ratios will be calculated regarding this chunk size, the default is 16k(Repository pool). Other options are 8k and 32k.
For example: ./zrest -p /mnt/share -c 8k
--estimate or -e : using this parameter, you can decide what results to estimate, it is possible to estimate only dedupe or only compress. It is also possible to ignore zeros for dedupe estimation.
The default is all, meaning dedupe (with zeros) and compression will be calculated.
dedupe, compression and zeros calculation can be ignored by:
no-dedupe | no-compress | no-dedupe-zeros
For example, to start tool without compression : ./zrest -p /mnt/share -c 8k -e no-compress
--freq or -r : report frequency in percents, default is 1.
For example, to run the tool with report frequency 5%: ./zrest -p /mnt/share -c 8k -r 5
--path or -p : to indicate the path of the required volume, use space between paths if you intend to run the tool over multiple volumes.
--chunk_size or -c :compression and dedupe ratios will be calculated regarding this chunk size, default is 16k, can run with 8k, 16k and 32k.
For example: ./zrest -p /mnt/share -c 8k
--estimate or -e : using this parameter, you can decide what to estimate, it is possible to estimate only dedupe or only compress. Also it is possible to ignore zeros for dedupe estimation.
Default is all, meaning dedupe (with zeros) and compression will be calculated.
dedupe, compression and zeros calculation can be ignored by:
no-dedupe | no-compress | no-dedupe-zeros
For example, to start tool without compression : ./zrest -p /mnt/share -c 8k -e no-compress
--freq or -r : report frequency in percents, default is 1.
For example, to run the tool with report frequency 5%: ./zrest -p /mnt/share -c 8k -r 5
--threads or -t : number of threads. Default=2*number_of_cpus .
For example, to run the tool with 10 threads: ./zrest -p /mnt/share -c 8k -t 10
--scan_time or -T : scan time limit in seconds. Default=3600. Set 0 for no limit.
The Scanning phase will stop when this limit is reached and Analyze phase will start on the data that was scanned. Increasing the scanning and the analyzing time increases accuracy.
For example, to limit scanning to 2 hours: ./zrest -p /mnt/share -c 8k -T 7200
--out or -o : Output file path, output will be saved in the chosen file
For example, to save the tool output to /tmp/out.txt: ./zrest -p /mnt/share -c 8k -o /tmp/out.txt
--memory or -M : use this option to limit the memory that will be used by the tool. Default=80% of available VC Physical memory. Set 0 for no limit.
For example, to limit memory usage by 5GiB: ./zrest -p /mnt/share -c 8k -M 5G
--help or -h : use to show all the options and parameters.
--version or -V : Show version of the tool.--threads or -t : number of threads. Default=2*number_of_cpus .
For example, to run the tool with 10 threads: ./zrest -p /mnt/share -c 8k -t 10
--scan_time or -T : scan time limit in seconds. Default=3600. Set 0 for no limit.
The Scanning phase will stop when this limit is reached and Analyze phase will start on the data that was scanned. Increasing the scanning and the analyzing time increases accuracy.
For example, to limit scanning to 2 hours: ./zrest -p /mnt/share -c 8k -T 7200
--out or -o : Output file path, output will be saved in the chosen file
For example, to save the tool output to /tmp/out.txt: ./zrest -p /mnt/share -c 8k -o /tmp/out.txt
--memory or -M : use this option to limit the memory that will be used by the tool. Default=80% of available VC Physical memory. Set 0 for no limit.
For example, to limit memory usage by 5GiB: ./zrest -p /mnt/share -c 8k -M 5G
--help or -h : use to show all the options and parameters.
--version or -V : Show version of the tool.
The tool can run with different parameters, size parameters accept suffixes K(KiB), M(MiB), G(GiB), T(TiB).
Running the tool
Please follow the below steps to use the tool:
- Expose/mount the required volume to the server.
- Download the tool to your server
- tar zxvf zrest-linux-1.04.tgz
- Run the tool on the required volume,
Example (Linux):
/home/zadara/zrest -p /mnt/smb_share/
Example (Windows):
c:\zrest\zrest.exe -p z:\
- You can stop the tool when you see that it is converging to a certain dedupe ratio. Typically, this happens at around ~25% to 30% of the run.
Understanding the Output
The output of the tool is shown in 3 sections:
- Input Parameters - prints the parameters that were provided by the user
- Scanning information:
- First row, prints the scanned paths, estimation of total amount of data & total number files, and the average file size.
- Second row, prints the scanned time, total amount of scanned data, scanning rate, total number of scanned files, *Reductible amount of data which will be used during Analyzing phase and the memory usage during the scanning phase.
*Reductible amount of data ignores files smaller than the chunk size. - Analysis information:
- The analyzing time in seconds
- The progress in GiB and percentage
- The analyzing Bandwidth in MiB/s
- The amount of used Memory
- The Dedup Ratio (will not considers zeros if no-dedupe-zeros option is set)
- The Low/High Dedup Ratio. It takes some time to display an accurate dedup ratio. During this time the tool presents a dedup low to high range - with a theoretical DEDUPE value.
- The Compression ratio
- The % of Zeros in the analyzing data
Support and questions
If you encounter with any difficulties running the tool or have any additional questions, please contact support@zadarastorage.com