Kalanand's February 2014 Log

   January 2014   
February 2014
SuMoTuWeThFrSa
      1
2345678
9101112131415
16171819202122
232425262728 
   March 2014   

February 3rd

What does the term 'scalable data' mean ?

It means that if you have N data items then you should do no more than

In the past:       Nm operations (polynomial time algo)
Now:               Nm/k operations for some large k
Soon:               N log(N) operations (especially for streaming data, e.g., LSST experiment will collect 30 TB/night)

MySQL refresher

SQL is a declarative language. A declarative language is non-procedural and very high-level. This means the programmer specifies what needs to be done rather than how to do it. The software will seek an answer to the question by interrogating a database containing Facts and Rules. It does not matter what order the facts and rules are arranged within the database - unlike procedural languages - the computer will find the best path towards the answer. There will either be a matching answer or a 'False' is returned where there was no answer to be found.

Below is my log from running MySQL in an interactive session.
# mysql history file: $HOME/.mysql_history
# During interactive session I can go to previous command using up/down arrow
To check if mysql is installed
mysql --help
mysql --version
Run some simple tests to verify that you can retrieve information from the server.
mysqlshow
mysqlshow mysql
mysql -e "SELECT Host,Db,User FROM db" mysql
To start a mysql interactive session
mysql 
# or
mysql -h host -u user -p 
To exit mysql, type ctr^d or "Q" or "QUIT".

To run a shell command
\! pwd
Some very basic commands
SELECT VERSION(), CURRENT_DATE;
SELECT VERSION(), NOW();
SELECT USER(), NOW();
SELECT SIN(PI()/4), (4+1)*5;
Let's try to write query results into a text file
SELECT NOW(), USER(), SIN(PI()/4), (4+1)*5 INTO OUTFILE './test1.txt';
# For some reason this file gets written as: /usr/local/mysql/data/test1.txt
To write a CSV file instead
SELECT NOW(), USER(), SIN(PI()/4), (4+1)*5 INTO OUTFILE './test1.csv' 
FIELDS TERMINATED BY ','  ENCLOSED BY '"' LINES TERMINATED BY '\n';
To see the available databases in memory
SHOW DATABASES;
Let's try to access a database named "test"
USE test;
Next, let's check how many tables I have in this database
SHOW TABLES;
Now let's see what information is there in a table called 'func'. If the table is not empty then we can run SELECT, JOIN, PROJECT, ... queries.
DESCRIBE func;
To run the query from a script instead of interactively:
source filename;

# Or,
\. filename
To run the above script from commandline (or in batch mode):
mysql db_name < script.sql  

Notes:

Some interesting mysql options
--html, -H                 # Produce HTML output. 
--silent, -s               # Silent mode. Produce less output.
--skip-column-names, -N    # Do not write column names in results. 
--table, -t                # Display output in table format. 
--verbose, -v              # Verbose mode. 
--xml, -X                  # Produce XML output. 
--raw	                   # Write column values without escape conversion
--delimiter=str            # Set statement delimiter. The default is ";”". 
--debug-info, -T 
--batch, -B                # Print results using tab as the column separator 

Data import from a text file
Here is an example

CREATE TABLE features (Store INT,Date DATE,Temperature DEC,Fuel_Price DEC,MarkDown1 DEC,MarkDown2 DEC,MarkDown3 DEC,MarkDown4 DEC,MarkDown5 DEC,CPI DEC,Unemployment DEC,IsHoliday VARCHAR(20));

LOAD DATA INFILE './features.csv' INTO TABLE features FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n' IGNORE 1 LINES
(
Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
);
For simple files this can also be done from command line
mysqlimport db_name textfile1 [textfile2 ...]
For each text file named on the command line, mysqlimport strips any extension from the file name and uses the result to determine the name of the table into which to import the file's contents.

Practice test:
Practice test samples of SQL query are available at

http://sqlzoo.net/wiki/Main_Page
http://beginner-sql-tutorial.com
http://www.w3schools.com/sql/default.asp

February 5th

Examples of SELECT statement

Query to select page url by page-rank
SELECT pageURL, pageRank  FROM Rankings 
WHERE pageRank > 10
Query to find ad-revenue by IP prefix
SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM userVisits 
GROUP BY SUBSTR(sourceIP, 1, 7)
Note: The built-in function SUBSTR() or SUBSTRING() returns a substring. The first argument is the string. The second argument is the position of the first character to extract. The third argument is the number of characters to extract.

SELECT INTO statement
We can copy all columns into the new table using SELECT INTO statement:
SELECT *
INTO newtable [IN externaldb]
FROM table1;
Or we can copy only the columns we want into the new table:
SELECT column_name(s)
INTO newtable [IN externaldb]
FROM table1;
The new table will be created with the column-names and types as defined in the SELECT statement. You can apply new names using the AS clause.

MySQL query: union, difference, and join

Here is an example of union (R1 U R2):
SELECT * FROM R1
UNION  [ALL]
SELECT * FROM R2
And the result
 R1                R2             R1 U R2         R1 U R2 with ALL option 
-------          -------          -------         -----------------------
A   B            A    B            A    B              A    B
a1  b1           a1   b1           a1   b1             a1   b1
a2  b1           a3   b4           a2   b1             a2   b1
                                   a3   b4             a3   b4
                                                       a1   b1

Here is an example of difference (R1 - R2):

SELECT * FROM R1
EXCEPT
SELECT * FROM R2
And the result
R1 - R2
-------
a2  b1
An example of join (or inner join, R1 JOIN R2 on A=B):
SELECT * FROM R1, R2
WHERE R1.A = R2.A
or, alternatively
SELECT * FROM R1 JOIN R2
ON R1.A = R2.A
Both of the above are ways to spell the same query. The result is
R1.A   R1.B    R2.A    R2.B
a1      b1      a1      b1

Q: Write query to find all hospitals within 5 miles of a school.

SELECT DISTINCT h.name 
FROM hospitals h, Schools s 
WHERE distance (h.location, s.location) < 5
Q: Write query to find all user clicks made within 5 seconds of a page load.
SELECT * FROM Clicks c, PageLoad p
WHERE abs(c.click_time - p.load_time)< 5
Here is an example of outer join using the tables shown below
      P                               J
----------------------        ----------------------
age    zip    disease         job       age   zip    
----------------------        ----------------------
54    98125    heart          lawyer    54     98125
20    98120    flu            cashier   20     98120
33    98120    lung
SELECT * FROM P LEFT OUTER JOIN J 
ON P.age = J.age AND P.zip = J.zip
The result is
----------------------------
age    zip    disease   job 
----------------------------      
54    98125    heart   lawyer 
20    98120    flu     cashier
33    98120    lung    null
One can similarly perform right outer join which will keep all elements from the second table instead. An example of JOIN and SELECT INTO statements used in the same query
SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue 
INTO Temp 
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL 
      AND UV.visitDate BETWEEN '2000-01-05' AND '2000-01-22'
GROUP BY UV.sourceIP;


SELECT sourceIP, totalRevenue, avgPageRank FROM Temp
ORDER BY totalRevenue DESC
Limit 1;
An example of distributed query
CREATE VIEW Sales AS 
SELECT * FROM JanSales 
UNION ALL
SELECT * FROM FebSales 
UNION ALL 
SELECT * FROM MarSales;
 
CREATE TABLE MarSales (
 OrderID INT,
 CustomerID INT NOT NULL, 
 OrderDate DATETIME NULL CHECK( DATEPART(mm, OrderDate)=3),
 CONSTRAINT OrderIDMonth PRIMARY KEY(OrderID)
)
--- similarly, create tables for FebSales and JanSales

what is the difference between GROUP BY and ORDER BY in sql

ORDER BY alters the order in which items are returned.

GROUP BY will aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc).

Creating index in MySQL query

How to create an index in SQL:
CREATE INDEX name_index
ON Employee (Employee_Name);
In practice, we first create index on a column of the table and then perform query on that column
CREATE INDEX seq_idx ON sequence(seq);
SELECT seq FROM sequence 
WHERE seq = 'GATTACGATTA'
To create a multi-column index
CREATE INDEX name_index
ON Employee (Employee_Name, Employee_Age);
Note: SQL Logical Operations have 3 possible values: TRUE, FALSE, and UNKNOWN. In the majority of databases, a comparison to NULL returns UNKNOWN - this is true even when comparing NULL to NULL. The correct way to check for a NULL or a non-NULL column is to use the IS NULL or the IS NOT NULL syntax.

For example, let's consider the following example of an SQL table:


Computer {

                modelNumber CHAR(30) NOT NULL,
                laptopModel    CHAR(15),

         }
Assume that the table stores entries for all the makes of PCs and laptops - and if it's a laptop the laptopModel field is set. Given this information, write an SQL statement that returns only the PCs and no laptops from the table above. Remember that we cannot use the equality operator (the "=") to test for a NULL column value.
SELECT * FROM Computer WHERE laptopModel IS NULL;

Commonly used SQL aggregate functions

AVG() - Returns the average value
COUNT() - Returns the number of rows
FIRST() - Returns the first value
LAST() - Returns the last value
MAX() - Returns the largest value
MIN() - Returns the smallest value
SUM() - Returns the sum

Few useful cases to keep in mind:

1.) Suppose I want to write an SQL query that will return the maximum value from the "Numbers" column in the table below, without using a SQL aggregate like MAX or MIN.
  Compare
----------
| Numbers |
----------
| 40      |
----------
| 68      |
----------
| -20     |
----------
| 99      |
----------
Here is a neat way to do this
SELECT Numbers FROM Compare ORDER BY
      Numbers DESC  --order in descending order
LIMIT 1;  --retrieve only one value

2.) Suppose we want to find the 2nd highest salary in a table

SELECT MAX(Salary) FROM Employee
WHERE Salary NOT IN (SELECT MAX(Salary) FROM Employee );

3.) Now, suppose we want to find Nth highest salary

SELECT Salary FROM Employee ORDER BY Salary DESC LIMIT N-1,1;
where substitute 'N' by the appropriate numerical value.

Transactions in MySQL

MyISAM is the default storage engine in MySQL. If a table is created using MyISAM, then that table does not have transaction support. To enable transaction support in MySQL you must use either BDB (which stands for Berkeley DB) or InnoDB storage engine when creating a table. Here is an example:
CREATE TABLE EMPLOYEE (
  EMPLOYEE_ID   CHAR(5)  NOT NULL, 
  EMPLOYEE_NAME VARCHAR(50) NOT NULL,
  PRIMARY KEY (EMPLOYEE_ID) 
)
ENGINE = INNODB;
If a table is created with one of the storage engines that supports transactions - either BDB or InnoDB - then there are two modes of transaction possible:

Granting privileges/permissions in SQL

GRANT privilege_name  /* e.g., ALL, SELECT, EXECUTE, UPDATE, INSERT, CREATE */
ON object_name        /* e.g., TABLE, VIEW, SEQUENCE */
TO {user_name |PUBLIC |role_name} /* Use PUBLIC to grant access to all users */
[WITH GRANT OPTION]; /* allows a user to grant access rights to other users */
For example
GRANT SELECT ON SOME_TABLE TO SOME_USER;

SQL REVOKE Command

REVOKE privilege_name
ON object_name
FROM {user_name |PUBLIC |role_name}; 
For example
REVOKE SELECT ON employee FROM user1;

Some other SQL commands

Create user:
CREATE USER username
[IDENTIFIED BY password]
[other options];

Case statement:

CASE input_expression /* usually a column_name */
WHEN conditional_expression1 THEN result_expression1
[WHEN conditional_expression2 THEN result_expression2]
[ELSE result_expression]
END
For example
SELECT LANGUAGE_NAME, RATING_CODE AS RATING, 
CASE RATING_CODE
WHEN '3' THEN 'DIFFICULT'
WHEN '2' THEN 'FAIRLY DIFFICULT'
WHEN '1' THEN 'EASY'
ELSE 'UNKNOWN'
END AS RATING_DESCRIPTION
FROM PROGRAMMING_LANGUAGES;

RANK and DENSE_RANK statements:
Use of RANK()

SELECT Employee_ID, Salary, 
RANK() over (ORDER BY Salary DESC) AS Ranking
FROM Employee;
I get an output like
Employee_ID  Salary   Ranking
4            800      1
3            200      2
9            200      2
12           100      4
Use of DENSE_RANK()
SELECT Employee_ID, Salary, 
DENSE_RANK() over (ORDER BY Salary DESC) AS DenseRank
FROM Employee;
The result is
Employee_ID  Salary   DenseRank
4            800      1
3            200      2
9            200      2
12           100      3

RENAME a table:

RENAME TABLE {tbl_name} TO {new_tbl_name};
For example
CREATE TABLE employees
( id NUMBER(6),
name VARCHAR(20)
);
INSERT INTO employees( id, name ) values( 1, 'name 1');
INSERT INTO employees( id, name ) values( 2, 'name 2');
INSERT INTO employees( id, name ) values( 3, 'name 3');

SELECT * FROM employees;

RENAME TABLE employees TO employees_new;
SELECT * FROM employees_new; 

Create view: A view is a relation defined by a query. The syntax is

CREATE VIEW view_name AS
SELECT column_list
FROM table_name [WHERE condition]; 
For example, to create a view on the product table
CREATE VIEW view_product AS
SELECT product_id, product_name
FROM product; 
Another example. Let's assume we have two tables:
Purchase (customer, pid, store)
Product (pid, price)
Customer (cid, name, city)

We want to create a view called "StorePrice" to show store names and price
CREATE VIEW StorePrice AS 
SELECT x.store, y.store FROM Purchase x, Product y 
WHERE x.pid = y.pid;

How to use a view ? Just like a normal table. For example,

SELECT DISTINCT z.name, u.store 
FROM customer z, Purchase u, StorePrice v 
WHERE z.cid = u.customer AND u.store = v.store AND v.price > 1000;

SQL user-defined functions

As a user I can write a function, register it in the database, call it from SQL, and set permissions on it. Below are a few simple examples of user-defined functions (UDFs):
CREATE FUNCTION CubicVolume (@CubeLength decimal(4,1), @CubeWidth decimal(4,1),
    @CubeHeight decimal(4,1) )
RETURNS decimal(12,3)
AS
BEGIN
   RETURN ( @CubeLength * @CubeWidth * @CubeHeight )
END
The above function is a scalar UDF because it returns a single value of a predefined data type. This function can then be used like
SELECT CubicVolume(r.a, r.b, r.c) FROM Shape r 
WHERE CubicVolume(r.a, r.b, r.c) > 10;
SQL also supports user-defined functions that return a table data type. Here is an example to create a table with 100,000 rows with random values from 1 to 10 in the DataVal column.
CREATE TABLE LargeTable
  (KeyVal int NOT NULL PRIMARY KEY,
   DataVal int NOT NULL CHECK (DataVal BETWEEN 1 AND 10)
  );

WITH Digits
AS (SELECT d FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) AS d(d))
INSERT INTO LargeTable (KeyVal, DataVal)
SELECT 10000 * a1.d + 1000 * a2.d + 100 * a3.d + 10 * a4.d + a5.d + 1,
       10 * RAND(CHECKSUM(NEWID())) + 1
FROM Digits AS a1,  Digits AS a2,  Digits AS a3, Digits AS a4, Digits AS a5;
We can use the above function as
SELECT DataVal FROM LargeTable( 2, 2);

February 7th

Vagrant, the virtual machine

Vagrant is a free and open-source software for creating and configuring virtual development environments. It can be considered a wrapper around virtualization software such as VirtualBox and configuration management software such as Chef, Salt and Puppet. See Wikipedia for details.

Instructions to download and use Vagrant can be found here http://www.vagrantup.com/

To get started, the most used commands are
vagrant up                                   (start your virtual machine)

vagrant suspend                              (put the virtual machine to sleep, it takes 
                                              ~ a minute to complete 'vagrant up' afterward)

vagrant halt                                 (shuts down the virtual machine, it takes 
                                              ~ a minute to complete 'vagrant up' afterward)

vagrant ssh                                  (login to the virtual machine)

vagrant destroy                              (destroys the virtual machine, it takes 
                                              ~ 20 minutes to complete 'vagrant up' again)

February 16th

Recipe to compute factorization/renormalization scale uncertainties

For the umpteenth time I had to write a recipe to compute scale uncertainties associated with the use of leading order W+jets Madgraph simulation. Here is the email I sent to Phil Dudero today.
This is a three step process.

Step 1: Produce W+jets events in MCFM using
close-to-analysis-level cuts. Configuration can
be found here
https://github.com/kalanand/UserCode/blob/master/WJets/inputWjets_LnuJJ.DAT
Changes needed:

'6.0'        [file version number]
'tota'       [part 'lord','real' or 'virt','tota']
7000d0       [sqrts in GeV]
  

Please change these to actual version number, 'lord' and
8000d0, respectively. Please also adjust the jet, lepton,
and MET cuts to match the analysis requirements.
Documentation: http://mcfm.fnal.gov/mcfm.pdf
----- Preparation if you have never run MCFM ----- 

- Please checkout MCFM from the above link or from here:
https://twiki.cern.ch/twiki/bin/viewauth/CMS/MCFM
Alternatively:
Run from Yurii's directory (~maravin/nobackup/MCFM/Bin, 
so  you do not need to install yourself.

---------------------------------------------------------

Step 2:  Please change the factorization/renormalization
scale in the following lines

1d0        [scale:QCD scale choice]

1d0        [facscale:QCD fac_scale choice]

Change both to 12d0 for factorization/renormalization
scale-down and to 15d0 for scale-up.


Step 3: Plot the ratio histograms: scale-up/default and
scale-down/default in hadronic W pt. Now apply these
ratios to your Madgraph simulated W+jets events. You will
get the uncertainty envelop similar to the one I had
last year. 
Turns out that in the more recent versions of MCFM, 12d0 and 15d0 are incorrect codes for scale up/down, we should be using -3d0 and -6d0. Phil was able to debug this, here is his full message
I decided to look myself at the MCFM code, and I find these snippets in the src/Need/reader_input.f:

c--- set up the default choices of static scale, if required
      if (scale .lt. 0d0) then
        if     (scale .eq. -2d0) then
          factor=0.25d0
        elseif (scale .eq. -3d0) then
          factor=0.5d0
        elseif (scale .eq. -4d0) then
          factor=0.75d0
        elseif (scale .eq. -5d0) then
          factor=1d0
        elseif (scale .eq. -6d0) then
          factor=2d0
        elseif (scale .eq. -7d0) then
          factor=4d0
        else
          factor=1d0
        endif

and

      if (facscale .lt. 0d0) then
        if     (facscale .eq. -2d0) then
          factor=0.25d0
        elseif (facscale .eq. -3d0) then
          factor=0.5d0
        elseif (facscale .eq. -4d0) then
          factor=0.75d0
        elseif (facscale .eq. -5d0) then
          factor=1d0
        elseif (facscale .eq. -6d0) then
          factor=2d0
        elseif (facscale .eq. -7d0) then
          factor=4d0
        else

So I think 12d0 and 15d0 are the incorrect codes, I should be using -3d0 and -6d0.

February 24th

Closed form expression of some commonly used quantities

Natural numbers:
For integer series

1 + 2 + 3 + ... + n = Σni=1 i = n (n+1) /2.

Infinite geometric series:
For r ∈ (-1, 1), the geometric series is given by

Σi=0 a * ri = a / (1 − r).

Fibonacci Number:
The Fibonacci numbers are the sequence of numbers {Fn}n=1 defined by the linear recurrence equation
Fn = F(n-1) + F(n-2) with F1 = F2 = 1.
It is conventional to define F0 = 0.

The Fibonacci numbers for n=1, 2, ... are 1, 1, 2, 3, 5, 8, 13, 21, ...
Fibonacci numbers can be viewed as a particular case of the Fibonacci polynomials Fn(x) with Fn = Fn(1).
The closed form for Fn is given by
Fn = (αn − βn)/(α − β), where α and β are the roots of x2 = Ax+B.
Here, A = B = 1, so the equation becomes x2 −x −1 = 0, which has roots   x = (1/2) * (1 ± √5).

The closed form is therefore given by
Fn = [(1 + √5)n − (1 − √5)n] / (2n √5)

Benford's law:
Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time.

A set of numbers is said to satisfy Benford's Law if the leading digit d occurs with probability

P(d) = log10 (1 + 1/d).

Go to January's log


Last modified: Wed Mar 26 11:51:06 CST 2014