Kalanand's September 2016 Log

   August 2016   
September 2016
SuMoTuWeThFrSa
    123
45678910
11121314151617
18192021222324
252627282930 
   October 2016   

September 6th

HyperLogLog: Counting unique things

Let's say I want to know the number of unique elements in a list, i.e., its cardinality N. For example, the number of unique websites I visited today, or the number of unique searches I performed, or the unique words I read in today's newspaper, etc.

If I want exact answer, I need to store all unique elements and their counts in a map. This map consumes O(N) memory, which is, quite prohibitive for large N.

There is a class of algorithms to provide good approximation of the number of unique elements using just a constant, and small, amount of memory. The best of such algorithms currently known is called HyperLogLog, and is due to Philippe Flajolet (2007):

The algorithm's practical implementation is discussed in detail in this 2013 paper from Google

How it works?

HyperLogLog is a probabilistic data structure. The main trick behind this algorithm is very simple. Suppose I have a string which consists of {0, 1} with equal probability. What is then the probability that it will start with '0', with '00', with '000', with k zeros? It is 1/2, 1/4, 1/8, and 1/2k etc. This means that if I have encountered a string with k zeros, I have approximately looked through 2k elements. Having a list of elements that are evenly distributed between 0 and 2k − 1 I can count the maximum number of the biggest prefix of zeros in the binary representation and this will give me a reasonable estimate of cardinality. This concept is similar to recording the longest run of heads in a series of coin flips and using that to guess the number of times the coin was flipped.

How precise is the estimation?

The standard error of HyperLogLog is 1.04/√m, where "m" is the number of registers used. For instance, cardinalities of order 109 can be estimated with an accuracy of 2% while using a memory of only 1.5 kilobytes.

Using 16384 (= 214) registers, the standard error is 0.81%. We use "Murmur3" hash function (https://en.wikipedia.org/wiki/MurmurHash) which yields a 128-bit hash value. If we use 14 bits of the hash output to address our registers, we are left with 128-14 = 114 bits to store the longest run of zeroes. So there is no practical limit to the cardinality of the sets we can count. Also, the error for very small cardinalities tend to be very small, as shown here:

HyperLogLogs are strings!!!

The fact that HyperLogLogs are strings avoided the introduction of an actual type at database/code level. This allows the work to be back-ported and the state-data can be retrieved and restored easily. Besides, general purpose hashes (e.g., Murmur3) are fast and have decently low collision rate for string:

Hmmm ... Uniform distribution, really?!!!

The problem is that the assumption of having evenly distributed numbers from 0 t 2k − 1 is too hard to achieve. The data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2k − 1 (Murmur3 gives values between 0 and 2128). The following study details the degree of uniformity of commonly used general purpose hashing algorithms (Murmur3 distribution is pretty uniform):

Implementation

A good open source C++ implementation of HyperLogLog is here

Example usage of the above implementation

#include <vector>
#include <string>
#include <iostream>
#include <fstream>
#include <hyperloglog.h>

int main(){
    hll::HyperLogLog hll(4);
    std::vector<std::string> somedata;

    //Load data
    for(auto & item : somedata){
        hll.add(item, item.size());
    }

    double cardinality = hll.estimate();

    std::cout << "Cardinality:" << cardinality << std::endl;

    std::ofstream ofs("path/to/dumpfile");
    hll.dump(ofs); // It can restore by restore().

    return 0;
}

More documentation/references

  1. https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
  2. http://stackoverflow.com/questions/12327004/how-does-the-hyperloglog-algorithm-work
  3. https://github.com/twitter/algebird/wiki/HyperLogLog
  4. http://oertl.github.io/hyperloglog-sketch-estimation-paper/paper.pdf
  5. https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
  6. https://github.com/optimizely/hyperloglog

September 15th

Use tcprewrite to remap IP subnets in a PCAP

I can use tcprewrite to map IP addresses in one subnet to IP addresses in another subnet.

Each source and destination subnet is expressed in CIDR notation, and needn't be the same size. The format is: <match_cidr>:<rewrite_cidr>,...


tcprewrite --infile=scans-test.pcap --outfile=scans-test-tcprewrite.pcap --pnat=172.16.0.0/24:192.168.0.0/24 --skipbroadcast 
Would cause traffic in 172.16.0.X to be remapped to 192.168.0.X. See here for details: http://tcpreplay.synfin.net/wiki/tcprewrite

September 16th

Periodicity of a time series of Δt values

Consider the following use-case scenario. I have a series of events occurring at roughly periodic intervals, say, every five to ten minutes. But the data is contaminated, so it also contains some fake events which are not periodic in nature. Let's assume I know the time-ordered Δt values.

Given this information, I want to quantify if the given series is periodic on the scale from 0 ("non-perodic") to 1 ("full-periodic").

Here is my quick and rough solution. I take the median of the Δt values and compute what fraction of the events are within a small tolerance window of the median. For now I assume this fraction is a good proxy for periodicity. Need to check coverage in known semi-periodic data to refine the methodology.


template<typename T>
float periodicity (std::vector<T>& intervals, const float tolerance=0.5) { 

   size_t size = intervals.size();
   if (size < 2) return 0.0;

   float median = 0.0;
   size_t in_bound = 0;
   sort(intervals.begin(), intervals.end());
   if (size  % 2 == 0) {
      median = (intervals[size/2 - 1] + intervals[size/2]) * 0.5;
   }
   else {
      median = intervals[size/2];
   }
   std::for_each (intervals.begin(), intervals.end(), [&](const float d) {
         if (fabs(d-median) <= tolerance*median) in_bound++; });
   return static_cast(in_bound)/size;
}

September 21st

After upgrading from OSX El Capitan (10.11.X) to macOS Sierra (10.12) today, I needed to update Xcode and python packages.

Xcode and command line tools

Upgrading all python packages with pip

To upgrade all my python packages in one go with pip, here is the command I ran as root ("sudo su") from within a python session.

import pip
from subprocess import call

for dist in pip.get_installed_distributions():
    call("pip install --upgrade " + dist.project_name, shell=True)

Then change the permissions in "/usr/local/lib/python2.7/site-packages".

Go to August's log


Last modified: Wed Sep 21 16:23:24 PDT 2016