Working with Indexers and Iterators

Overview

Once you have an open capture file and aquired a reference to a FileCapture or one of the subclasses you work with indexers and iterators to actually aquire packets, records or buffers. You can also use these indexers and iterators to modify the capture file by removing, adding, replacing, resizing, swaping either packets, records or just plain old raw byte buffers.

You can choose which type of elements you want to work with. As mentioned above, there are 3 kinds

  • FilePacket - a specialized subclass of normal Packet which adds file specific functionality. You also get the full packet decoding using the codecs found in the jnetstream.packet package.
  • Records - a jnetstream.capture.file package with specific elements that allow you to work with the structure of the capture file. You can access and modify any record within the file. There are various specific subclasses for each file format supplied with the core jNetStream capture packet. You can access convenient accessor and setter methods to decode and modify record headers.
  • Raw - raw byte buffers offer no decoding of record headers simply offer the raw data and pure performance.

Working with iterators

Iterators keep track of the current position, commonly referred to as a "cursor". The cursor is updated after each iteration and operation. The iterator will always keep the cursor pointing at a start of a record somewhere within the capture file or into editor's memory if you have begun to modify the file. Iterators are very similar to java.util.Iterator and offer the basic hasNext() and next() calls. But you can think of jNetStream iterators as super iterators that offer much more.

Each iterator implements the comprehensive FileModifier interface which has all the methods you would typically find in java.util.List interface for adding, removing and retaining elements. There is no get() method, but instead you call next() to retrieve any record at the current cursor position. You can also use the setPosition(long) method which allows you to jump anywhere in the file, if you have a specific address in mind. A much better way is to use one of numerous seek method available. You can seek by position, by capture timestamp found on a record, percentage of a file. There are also convienient seekFirst(), seekLast() and seekEnd() methods which allow you to efficiently jump to those locations.

As you might guess there are 3 types of iterators:

  • PacketIterator - gets and sets packet objects
  • RecordIterator - gets and sets record objects
  • RawIterator - gets and sets raw byte buffers

All 3 provide all of the above methods plus few specific methods geared towared the element type they deal with.

One fact which may not be obvious is the subtle different between iterators different when aquired from the FileCapture object.

That is PcapFile.getRecordIterator() and PcapFile.getPacketIterator(). It would seem that there would be 1 to 1 relationship between a record and a packet.

There a small subtlety here. There would be 1 to 1 relationship betwen packets and records if the file format was such. For the most part Pcap and Snoop file formats are very close. jNetStream keeps track of all the records including the file header which in jNetStream terms is called a Block Record. Another words, Pcap and Snoop file formats have 1 extra record in the beginning of the file. So first record returned from the record iterator is a block record not a packet record. Other file formats which use so called "channels" or simple meta records, records which do not contain packet data, will also not maintain this 1 to 1 relationship. jNetStream clearly differentiates the two and the user should be aware of this.

Same thing goes for RawIterator. ALso keep in mind, that although Pcap and Snoop file formats only have a single block record, other formats do not. As a matter of fact the FileCapture interface provides a BlockIterator which iterates only over block records.

(Note: as a conveniece, all iterators provide seekFirst)and seekSecond() methods. The second is there since this is such a common thing to do when dealing with RecordIterator, that is to skip over the block record, that the seekSecond() call has been included.)

So our last iterator is a record iterator but one that only iterates over block records. Its signature looks like this RecordIterator<? extends BlockRecord>. Pcap and Snoop files only have a single block record but other formats may contain more than 1. For convenience the PcapFile and SnoopFile subinterfaces provide a getBlockRecord() method to give you the single block record always present. NapFile subinterface does not provide such an interface as there are usually more than 1 block records.

Note: you should review the examples section which shows a number of iterator based examples.

Working with indexers

Indexers differ from iterators explained above, in 1 main area. They do not keep track of "cursor" position and they keep a map of record positions for each capture file. You use packet and record indexes with all the accessor methods.

Just like there are 3 types of iterators, there are 3 types of indexers that each deals with differnet type of element:

  • PacketIndexer - gets and sets packet objects
  • RecordIndexer - gets and sets record objects
  • RawIndexer - gets and sets raw byte buffers

Very similar to interators, but like I said earlier, each methods takes an extra parameter which is a packet or record index, depending on the type of indexer. Each indexer implements a IndexedFileModifier interface which provides the same type of methods that java.util.List interface does. The methods are identical to FileModifier interface implemented by iterators, with the exception where an additional index paramer is used.

When an indexer is created, the indexer first needs to scan the current file and any edits done on it, to determine position of each record and keeps it in a indexed array. When you specify a record index to one of the indexer's methods, the index is looked up to find the real address of the record either in file or memory and the operation is performed.

Indexing a very large file may take time so care must be taken wheather to invoke this capability or not. Once invoked, an index is created. There are additional resources that are allocated between the editor and the indexer to provide accurate record address to index mappings. So if you need to lots of random access IO, indexers are much more efficient as record indexes are known and can easily be positioned. But if you mainly need to iterate over the file and once in a while jump back or ahead, its better to use one of the iterator's seek methods which will find the desired record. Indexers and their resources kept referenced via a SoftReference. Only actual PacketIndexer, RecordIndexer or RawIndexer hold a hard reference to the index data set and resources. So when you release the reference to all indexers, eventually all the back-end resources will be garbage collected by the java VM. So its important not to keep uneccessary reference to indexers, especially for large files. On the flip side, if you plan on using an indexer at a later time and the file is very large and you want to hold on to the index resources, you must keep a hard reference to it somewhere, even when not using the indexer for a while. Its really is upto the user to decide.

Once last note about the index cache. All indexer share a common index cache which may not be able to hold all the index data all at once in memory. Various algorithms are used to cache this information. The exact amount of data an index may take up is underterminate. The base algorithm keeps indexes partitioned. That is upto 10,000 hard indexes are kept in memory, of course less if the file is smaller. Each of those hard indexes references a different portion of the physical file. If you use an index that is in between these large index sections, then the index data may be needed to either read from a file cache, or portions of the capture file reindexed to filled in the blanks, if you will.

Lest take a look at an extreme example. Lets say we have a file with 1 Terra bytes of data in it. There are actually 10,000,000,000 (10 billion) packets at an average of 100 bytes each in this file. Obviously this would be impossible to index using any existing desktops completely in memroy, atleast today. So what the indexer does is it iterates over the entire file mapping large chunks of it into memory (100Meg chunks) so it can scan the memory mapped buffers. It generates lots of 1,000,000 long lists of soft indexes as it goes along. It only remembers the first index at each 1M index boundary using a hard reference. By the end of the file we have 10,000 hard indexes that reference the file every 1,000,000 records. Most of the soft references have been garbage collected as available memory became scarce. But we end up with a rough index every 1,000,000 packets. So if we ask to index record 1,000,001, the indexer finds that the closest hard index it has is 1,000,000. It aquires that buffer with physical file contents, scans that buffer and generates some more soft indexes. It can now easily lookup the physical file position for index 1,000,001. Eventually the soft index may be GCed once again as we stop referencing it and that memory will be reclaimed. Keep in mind it takes about 2 seconds to scan about 6,000,000 packets for indexes right now. So even if we had to scan entire 1,000,000 records it would take milli seconds on this terra-byte big file.

There is another algorithm, that I opted not to enable by default to cache all the indexes in file. The problem with file caching indexes, is that in our extreme example about the index cache file would be about 80Gig in size all by itself. 10G of indexes at 8 bytes each. Not practical. The best overall solution is to reaquire the indexes as needed using specialized algorithms which utilize a rough index table to quickly home in on the actual record position.

Common link between iterators and indexers

Iterators and indexers both work with a common editor which keeps track of any changes to the capture file or capture session in general. Both indexers and iterators respond to changes reported by the editor in real time. There fore if you are working with both an iterator and an indexer, making changes with one changes the state of the other. So if you use an iterator to remove lets say 100 records from capture session, the indexer will immediately report 100 packets less using its size() method. Also indexes are relative to changes.

If say you have 100 records, which are indexed, the packet at index 60 will point to packet record that was captured exactly at certain time A (i.e. midnight today). If you remove 20 records before that particular record, either using an iterator or the indexer itself, the index of our packet record with timestamp A will shift from 60 to 40. So accessing index of 60 will return you some other unexpected record that was actually record at index #80. The same goes of the byte address of the record if you inquire it either from the packet its self FilePacket.getPositionGlobal() or from indexer getPosition(40), you will notice that the record position has actually shifted as well by the amount removed in the remove operation of 20 records.

Returned packets do not provide a way to aquire its index, but they do provide a getPositionGlobal() method which will return its current and accurate address within the capture session. If the physical record for the packet you are holding a reference to is removed, the packet will be invalidated, a flag you can check for with FilePacket.isValid(). Otherwise the position of the packet may keep on changing as you edit the file. Its actually quiet easy to map a position to an index with the indexer method mapPositionGlobalToIndexGlobal(globalPosition) which will return the index for the position in question. If the position does not align on a start of a record, a -1 will be returned.