Kalanand's September 2016 Log
September 6th
HyperLogLog: Counting unique things
Let's say I want to know the number of unique elements in a list, i.e., its cardinality N.
For example, the number of unique websites I visited today, or the number of unique
searches I performed, or the unique words I read in today's newspaper, etc.
If I want exact answer, I need to store all unique elements and their counts in a map.
This map consumes O(N) memory, which is, quite prohibitive for large N.
There is a class of algorithms to provide good approximation of the number of unique elements
using just a constant, and small, amount of memory. The best of such algorithms currently
known is called HyperLogLog, and is due to Philippe Flajolet (2007):
The algorithm's practical implementation is discussed in detail in this 2013 paper from Google
How it works?
HyperLogLog is a probabilistic data structure. The main trick behind this algorithm is very simple.
Suppose I have a string which consists of {0, 1} with equal probability.
What is then the probability that it will start with '0', with '00', with '000', with k zeros?
It is 1/2, 1/4, 1/8, and 1/2k etc. This means that if I have encountered a string with k zeros,
I have approximately looked through 2k elements. Having a list of elements that are evenly
distributed between 0 and 2k − 1 I can count the maximum number of the biggest prefix of zeros
in the binary representation and this will give me a reasonable estimate of cardinality.
This concept is similar to recording the longest run of heads in a series of coin flips and using
that to guess the number of times the coin was flipped.
How precise is the estimation?
The standard error of HyperLogLog is 1.04/√m, where "m" is the number of registers used.
For instance, cardinalities of order 109 can be estimated with an accuracy of 2% while using a
memory of only 1.5 kilobytes.
Using 16384 (= 214) registers, the standard error is 0.81%. We use "Murmur3" hash function
(https://en.wikipedia.org/wiki/MurmurHash)
which yields a 128-bit hash value. If we use 14 bits
of the hash output to address our registers, we are left with 128-14 = 114 bits to store the longest
run of zeroes. So there is no practical limit to the cardinality of the sets we can count.
Also, the error for very small cardinalities tend to be very small, as shown here:
HyperLogLogs are strings!!!
The fact that HyperLogLogs are strings avoided the introduction of an actual type at database/code level.
This allows the work to be back-ported and the state-data can be retrieved and restored easily. Besides,
general purpose hashes (e.g., Murmur3) are fast and have decently low collision rate for string:
Hmmm ... Uniform distribution, really?!!!
The problem is that the assumption of having evenly distributed numbers from 0 t 2k − 1 is too
hard to achieve. The data we encountered is mostly not numbers, almost never evenly distributed,
and can be between any values. But using a good hashing function you can assume that the output
bits would be evenly distributed and most hashing function have outputs between 0 and 2k − 1
(Murmur3 gives values between 0 and 2128). The following study details the degree of uniformity
of commonly used general purpose hashing algorithms (Murmur3 distribution is pretty uniform):
Implementation
A good open source C++ implementation of HyperLogLog is here
Example usage of the above implementation
#include <vector>
#include <string>
#include <iostream>
#include <fstream>
#include <hyperloglog.h>
int main(){
hll::HyperLogLog hll(4);
std::vector<std::string> somedata;
//Load data
for(auto & item : somedata){
hll.add(item, item.size());
}
double cardinality = hll.estimate();
std::cout << "Cardinality:" << cardinality << std::endl;
std::ofstream ofs("path/to/dumpfile");
hll.dump(ofs); // It can restore by restore().
return 0;
}
More documentation/references
-
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
-
http://stackoverflow.com/questions/12327004/how-does-the-hyperloglog-algorithm-work
-
https://github.com/twitter/algebird/wiki/HyperLogLog
-
http://oertl.github.io/hyperloglog-sketch-estimation-paper/paper.pdf
-
https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
-
https://github.com/optimizely/hyperloglog
September 15th
Use tcprewrite to remap IP subnets in a PCAP
I can use tcprewrite to map IP addresses in one subnet to IP addresses in another subnet.
Each source and destination subnet is expressed in CIDR notation,
and needn't be the same size. The format is: <match_cidr>:<rewrite_cidr>,...
tcprewrite --infile=scans-test.pcap --outfile=scans-test-tcprewrite.pcap --pnat=172.16.0.0/24:192.168.0.0/24 --skipbroadcast
Would cause traffic in 172.16.0.X to be remapped to 192.168.0.X.
See here for details: http://tcpreplay.synfin.net/wiki/tcprewrite
September 16th
Periodicity of a time series of Δt values
Consider the following use-case scenario. I have a series of events occurring at roughly periodic intervals, say, every five to ten minutes.
But the data is contaminated, so it also contains some fake events which are not periodic in nature. Let's assume
I know the time-ordered Δt values.
Given this information,
I want to quantify if the given series is periodic on the scale from 0
("non-perodic") to 1 ("full-periodic").
Here is my quick and rough solution. I take the median of the Δt values and compute
what fraction of the events are within a small tolerance window of the median.
For now I assume this fraction is a good proxy for periodicity. Need to
check coverage in known semi-periodic data to refine the methodology.
template<typename T>
float periodicity (std::vector<T>& intervals, const float tolerance=0.5) {
size_t size = intervals.size();
if (size < 2) return 0.0;
float median = 0.0;
size_t in_bound = 0;
sort(intervals.begin(), intervals.end());
if (size % 2 == 0) {
median = (intervals[size/2 - 1] + intervals[size/2]) * 0.5;
}
else {
median = intervals[size/2];
}
std::for_each (intervals.begin(), intervals.end(), [&](const float d) {
if (fabs(d-median) <= tolerance*median) in_bound++; });
return static_cast(in_bound)/size;
}
September 21st
After upgrading from OSX El Capitan (10.11.X) to macOS Sierra (10.12) today,
I needed to update Xcode and python packages.
Xcode and command line tools
- Get the latest Xcode from here.
- Once Xcode is installed, open a terminal, run xcode-select --install, and click the Install button to install the required command line developer tools.
- Install latest MacPorts version from https://www.macports.org/install.php.
- To confirm the installation is working as expected, try using port in a new terminal window:
port version
Version: 2.3.4
- Now follow migration procedure steps from here: https://trac.macports.org/wiki/Migration.
Upgrading all python packages with pip
To upgrade all my python packages in one go with pip,
here is the command I ran as root ("sudo su") from within
a python session.
import pip
from subprocess import call
for dist in pip.get_installed_distributions():
call("pip install --upgrade " + dist.project_name, shell=True)
Then change the permissions in "/usr/local/lib/python2.7/site-packages".
Go to August's log
Last modified: Wed Sep 21 16:23:24 PDT 2016