Kalanand's September 2016 Log

August 2016

October 2016

September 6th

HyperLogLog: Counting unique things

Let's say I want to know the number of unique elements in a list, i.e., its cardinality N. For example, the number of unique websites I visited today, or the number of unique searches I performed, or the unique words I read in today's newspaper, etc.

If I want exact answer, I need to store all unique elements and their counts in a map. This map consumes O(N) memory, which is, quite prohibitive for large N.

There is a class of algorithms to provide good approximation of the number of unique elements using just a constant, and small, amount of memory. The best of such algorithms currently known is called HyperLogLog, and is due to Philippe Flajolet (2007):

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

The algorithm's practical implementation is discussed in detail in this 2013 paper from Google

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf

How it works?

HyperLogLog is a probabilistic data structure. The main trick behind this algorithm is very simple. Suppose I have a string which consists of {0, 1} with equal probability. What is then the probability that it will start with '0', with '00', with '000', with k zeros? It is 1/2, 1/4, 1/8, and 1/2^k etc. This means that if I have encountered a string with k zeros, I have approximately looked through 2^k elements. Having a list of elements that are evenly distributed between 0 and 2^k − 1 I can count the maximum number of the biggest prefix of zeros in the binary representation and this will give me a reasonable estimate of cardinality. This concept is similar to recording the longest run of heads in a series of coin flips and using that to guess the number of times the coin was flipped.

How precise is the estimation?

The standard error of HyperLogLog is 1.04/√m, where "m" is the number of registers used. For instance, cardinalities of order 10⁹ can be estimated with an accuracy of 2% while using a memory of only 1.5 kilobytes.

Using 16384 (= 2¹⁴) registers, the standard error is 0.81%. We use "Murmur3" hash function (https://en.wikipedia.org/wiki/MurmurHash) which yields a 128-bit hash value. If we use 14 bits of the hash output to address our registers, we are left with 128-14 = 114 bits to store the longest run of zeroes. So there is no practical limit to the cardinality of the sets we can count. Also, the error for very small cardinalities tend to be very small, as shown here:

http://antirez.com/news/75

HyperLogLogs are strings!!!

The fact that HyperLogLogs are strings avoided the introduction of an actual type at database/code level. This allows the work to be back-ported and the state-data can be retrieved and restored easily. Besides, general purpose hashes (e.g., Murmur3) are fast and have decently low collision rate for string:

http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Hmmm ... Uniform distribution, really?!!!

The problem is that the assumption of having evenly distributed numbers from 0 t 2^{k − 1} is too hard to achieve. The data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^{k − 1} (Murmur3 gives values between 0 and 2¹²⁸). The following study details the degree of uniformity of commonly used general purpose hashing algorithms (Murmur3 distribution is pretty uniform):

https://research.neustar.biz/tag/murmur-hash/

Implementation

A good open source C++ implementation of HyperLogLog is here

https://github.com/hideo55/cpp-HyperLogLog

Example usage of the above implementation

#include <vector>
#include <string>
#include <iostream>
#include <fstream>
#include <hyperloglog.h>

int main(){
    hll::HyperLogLog hll(4);
    std::vector<std::string> somedata;

    //Load data
    for(auto & item : somedata){
        hll.add(item, item.size());
    }

    double cardinality = hll.estimate();

    std::cout << "Cardinality:" << cardinality << std::endl;

    std::ofstream ofs("path/to/dumpfile");
    hll.dump(ofs); // It can restore by restore().

    return 0;
}

September 15th

Use tcprewrite to remap IP subnets in a PCAP

I can use tcprewrite to map IP addresses in one subnet to IP addresses in another subnet.

Each source and destination subnet is expressed in CIDR notation, and needn't be the same size. The format is: <match_cidr>:<rewrite_cidr>,...


tcprewrite --infile=scans-test.pcap --outfile=scans-test-tcprewrite.pcap --pnat=172.16.0.0/24:192.168.0.0/24 --skipbroadcast

Would cause traffic in 172.16.0.X to be remapped to 192.168.0.X. See here for details: http://tcpreplay.synfin.net/wiki/tcprewrite

September 16th

Periodicity of a time series of Δt values

Consider the following use-case scenario. I have a series of events occurring at roughly periodic intervals, say, every five to ten minutes. But the data is contaminated, so it also contains some fake events which are not periodic in nature. Let's assume I know the time-ordered Δt values.

Given this information, I want to quantify if the given series is periodic on the scale from 0 ("non-perodic") to 1 ("full-periodic").

Here is my quick and rough solution. I take the median of the Δt values and compute what fraction of the events are within a small tolerance window of the median. For now I assume this fraction is a good proxy for periodicity. Need to check coverage in known semi-periodic data to refine the methodology.


template<typename T>
float periodicity (std::vector<T>& intervals, const float tolerance=0.5) { 

   size_t size = intervals.size();
   if (size < 2) return 0.0;

   float median = 0.0;
   size_t in_bound = 0;
   sort(intervals.begin(), intervals.end());
   if (size  % 2 == 0) {
      median = (intervals[size/2 - 1] + intervals[size/2]) * 0.5;
   }
   else {
      median = intervals[size/2];
   }
   std::for_each (intervals.begin(), intervals.end(), [&](const float d) {
         if (fabs(d-median) <= tolerance*median) in_bound++; });
   return static_cast(in_bound)/size;
}

September 21st

After upgrading from OSX El Capitan (10.11.X) to macOS Sierra (10.12) today, I needed to update Xcode and python packages.

Xcode and command line tools

Get the latest Xcode from here.
Once Xcode is installed, open a terminal, run xcode-select --install, and click the Install button to install the required command line developer tools.
Install latest MacPorts version from https://www.macports.org/install.php.
To confirm the installation is working as expected, try using port in a new terminal window:
```
port version
```
Version: 2.3.4
Now follow migration procedure steps from here: https://trac.macports.org/wiki/Migration.

Upgrading all python packages with pip

To upgrade all my python packages in one go with pip, here is the command I ran as root ("sudo su") from within a python session.


import pip
from subprocess import call

for dist in pip.get_installed_distributions():
    call("pip install --upgrade " + dist.project_name, shell=True)

Then change the permissions in "/usr/local/lib/python2.7/site-packages".

Go to August's log

Last modified: Wed Sep 21 16:23:24 PDT 2016