2006-07-14
| Rate This Article: | Add This Article To: |
Much of the data that computers work with contains a lot of redundancy. In other words, the data could be represented by a smaller number of bytes with no loss of information. We see this in action almost every day when we use GIF or PNG images or when we ZIP files for archive or emailing. Disk space and bandwidth savings can be considerable.
The bandwidth factor is particularly important as application architecture continues to move more and more to the distributed model. When large amounts of data have to travel across a network, whether locally or globally, data compression can provide benefits in terms of network load and application responsiveness. Version 2.0 of the .Net framework provides classes that perform the required tasks of compressing and decompressing data.
System.IO.Compression
This namespace contains the two classes that are used for data compression and decompression, DeflateStream and GZipStream. It is important to understand the relationship between these two classes.
DeflateStream encapsulates the Deflate algorithm, an industry-standard for lossless compression/decompression of data. DeflateStream provides "pure" data compression; you put in data at one end and the compressed data comes out the other
end (or vice versa).
GZipStream is a wrapper for DeflateStream that adds support for the gzip file format. This format adds a header and footer to the compressed data. The header contains metadata about the compressed data, and the footer contains a cyclical redundancy check value that is used to verify data integrity when decompressing it.
The GZipStream class is extensible to other data file formats. What this means is that if you want to compress data to be sent over a network or saved to a file that only your program will read, you can use DeflateStream. If you want to compress data and save it to a file that is readable by other gzip-compatible programs, or to read gzip files created by other applications, use GZipStream.
Using DeflateStream
Both of the compression classes are designed to work with streams. This section shows you how to use DeflateStream to write compressed data to disk. GZipStream works exactly the same way; the only difference is the format of the resulting file. What you do in code is essentially to create the stream — a
FileStream or MemoryStream, for example — as you usually would for reading or writing the data. Then, create a DeflateStream instance based on that stream and use the DeflateStream instance to read or write.
One way I use DeflateStream is when I need to serialize a large DataSet. The resulting XML file can be huge, and the compression algorithm works quite well on this kind of file. Assume that MyDataSet is the DataSet. Then, open the output file:
outputfile = New FileStream("mydata.xmd", FileMode.Create, FileAccess.Write)
Next, create a DeflateStream instance based on this FileStream:
ds = New DeflateStream(outputfile, CompressionMode.Compress, False)
Then, use the DataSet's WriteXml method to write the data to the stream:
MyDataSet.WriteXml(ds)
Finally, explicitly close both streams, making sure to close the
DeflateStream first:
ds.Close() outputfile.Close()
I have seen reports that explicitly closing the streams immediately after writing to them is necessary to flush them and prevent exceptions when later reading from the file. I have not encountered this error; it may have been a bug that Microsoft has since fixed. In any case, closing streams when you are done with them is a good idea, bug or not.
Note that the DeflateStream constructor takes three arguments. The first is the underlying stream and the second is either CompressionMode.Compress or CompressionMode.Decompress, depending on the action you require. The third argument is an optional Boolean value indicating whether the underlying stream
should be closed when the DeflateStream instance is closed.
Reading compressed data is pretty much the mirror image of the writing process:
inputfile = New FileStream("mydata.xmd", FileMode.Open, FileAccess.Read)
ds = New DeflateStream(inputfile, CompressionMode.Decompress, False)
MyDataSet.ReadXml(ds)
What kind of advantages can you expect from compressing serialized data in this manner? Conventional wisdom has it that small text files — and XML is text, of course — compress poorly, and large files compress well. Tests do not bear this out, however. Here are the results of some tests that I ran compressing different size XML files created by serializing different size DataSets:
1.4 kB compressed to 306 bytes (ratio = 4.6)
111 kB compressed to 4.5 kB (ratio = 24.5)
1.37 MB compressed to 149 kB (ratio = 9.2)
The smallest file did not compress as much as the larger files, but it's still a respectable decrease in size. Surprisingly the middle-size file compressed better than did the larger file.
Summary
You are limited to compressing 4GB of data with the .Net compression classes; that's probably not a major limitation for most of us! Also, the compression ratios you will achieve are, by Microsoft's own admission, not as good as can be obtained with other algorithms. But the best algorithms are protected by patents, while the one used by these .Net classes is not—a major advantage.
Incorporating data compression in your applications is not a no-brainer. The compression and decompression processes take time and resources, and the benefits you receive must more than counterbalance these costs. Different kinds of data compress differently, and even the size of the data comes into play. Text, such as XML data, can compress very well for large files but with less efficiency for small files. I think it's wise to perform tests in a real-world scenario — in other words, with the kind and size of data that your program will actually use — to determine what benefit, if any, you'll gain from data compression.
|
![]() |
|


