Thursday, March 29, 2018

Understanding Linked Lists with Go

Intro

I like to revisit the basics a lot. One of my favorites to revisit are the variable-length sequences known as linked lists. They are also used as the building blocks of some other important data structures like stacks and queues. Go is my favorite to demonstrate them with since the language doesn't get in the way.

Go already has variable-length sequences (slices) and hash tables (maps) built in, as those are the two most commonly used data structures that will solve most of your real-world problems. Indeed, slices generally provide a better alternative to linked lists in Go and still allow you to create stacks and queues. Although slices are simpler and quicker, knowing about linked lists is still a fundamental concept in programming.

You first need a basic understanding of pointers and methods in Go, so brush up on those if you need to and continue when you're ready.

The key advantage of linked lists is that, unlike arrays, you don't have to predefine how big they are. This is why slices are an alternative to linked lists since they can also grow arbitrarily in size, thanks to the append() built-in. Let's forget about slices for now and learn about linked lists.

The most basic linked list is a "chain of nodes" that only point forward, known as a "singly" linked list. Let's unpack what that means:

A linked list is a list of nodes that are linked together like a chain.
A Node is a composite type where one field holds a piece of data you care about and another field points to the next node in the list.
The last node in the list has its next field set to nil so that we know it's the end of the chain.

Pseudocode

In my opinion the best way to learn linked lists is to see the most basic form and in pseudocode before looking at real code.

A singly linked list at a minimum is just two structs and a method to append new nodes. (In class-based languages like Java, it'd be two classes instead of structs). So we start with two structs; One struct to hold our actual data - which is just a string in our example and another struct to wrap our first link in the chain (the "head"). Pseudocode:

struct Node
  data string
  next *Node

struct List
  head *Node

The List struct is what we'll create an instance of and attach our methods to:

method List Append(newNode *Node)
  if List.head == null
    // No head yet so create it
    List.head = newNode
  else
    // Already have a head; add new node to next link in chain
    currentNode = List.head  // start from head
    while (currentNode.next != null)
      currentNode = currentNode.next // loop 'til very end
    
    // Now currentNode is set to last node in list.
    // Assign to last node's "next" field to append the node
    currentNode.next = newNode

Next we'll look at what this looks like in Go and build on it.

Go Version

package main

import "fmt"

// A Node contains data and a link to the next node.
// The `next` field is same type as the struct, which is legal 
// because it's a pointer. Otherwise it'd give an error about
// "invalid recursive type Node".
type Node struct {
    data string
    next *Node
}

type List struct {
    head *Node
}

func (l *List) Append(newNode *Node) {
    if l.head == nil {
        l.head = newNode
        return
    }

    currentNode := l.head
    for currentNode.next != nil {
        currentNode = currentNode.next
    }
    currentNode.next = newNode
}

func main() {
    l := &List{}
    l.Append(&Node{data: "foo"})
    l.Append(&Node{data: "bar"})
    l.Append(&Node{data: "baz"})

    fmt.Printf("first=%+v\n", l.head)
    fmt.Printf("second=%+v\n", l.head.next)
    fmt.Printf("third=%+v\n\n", l.head.next.next)

    // Better yet, loop through the list
    // instead of manually chaining .next's
    for e := l.head; e != nil; e = e.next {
        fmt.Printf("e=%+v\n", e)
    }
}

Outputs:

    first=&{data:foo next:0xc42000a0a0}
    second=&{data:bar next:0xc42000a0c0}
    third=&{data:baz next:<nil>}

    e=&{data:foo next:0xc42000a0a0}
    e=&{data:bar next:0xc42000a0c0}
    e=&{data:baz next:<nil>}

Conveniently, the zero-value for pointers in Go is nil, which is exactly what we want `next` and `head` to be by default. So we can simply check if these fields were set or not yet, keeping our code more concise.

Ok, you've made a singly linked list! Pretty straight forward huh? (Pun intended).

Further Exploring

Make sure you really understand what's happening so far. Once you do you can easily build on that knowledge by learning more about linked lists and then about stacks, queues, trees, and other data structures.

Appending and other iterations through a list is in O(n) "linear" time which is slower than O(1) "constant" time of indexing. So there's a trade-off with the read performance of arrays but lists can be faster at inserts. Array values are stored one after the other in memory and nodes in a linked list can be all over the place in memory.

Of course, you don't actually have to name things as generally as List, Node, Append, etc. In real-world programs you'll probably have a more abstract representation such as Comments, Comment, AddComment, respectively.

A "circular" linked list has the last node pointing to the first node. There is also a "doubly" linked list, in which each node points to the next and the previous node. The names are pretty self-describing.

A singly linked list is fine when you only need to traverse in a forward direction (e.g. Social Media Posts). You can still sort and delete. You'll want to explore adding other common methods such as Prepend and Delete:

// Add a new head linked to the old head
func (l *List) Prepend(newNode *Node) {
    newHead := newNode
    newHead.next = l.head
    l.head = newHead
}

// Delete a node based on its data field's value
func (l *List) Delete(data string) {
    if l.head == nil {
        return
    }

    // If the head is matched then we need a new head
    if l.head.data == data {
        // Delete head by overwriting it with next node
        l.head = l.head.next
        return
    }

    currentNode := l.head
    for currentNode.next != nil {
        if currentNode.next.data == data {
            currentNode.next = currentNode.next.next
            return
        }

        currentNode = currentNode.next
    }
}

A "stack" is simply a list with Push and Pop methods. Typically in Go you'd just use a slice instead of creating a linked list since it's faster and easier. Then you'd use the slice's append function to push new data to the end of the slice and you'd attach your own Pop method to return/remove that last item from the list. Hence it's a LIFO structure (Last-in, First-out). Useful for operations such as undo/redo where your list is a history of commands. Note that you will also have to handle the case of popping on an empty list which will crash the program.

Python is a good example of stacks because it provides a built-in list type. Note that it's not implemented as a linked list but neither is a Slice in Go and both can still be used similarly. Python lists already come with .append() and .pop() methods:

    >>> stack = ["foo", "bar"]
    >>> stack.append("baz")
    >>> stack
    ['foo', 'bar', 'baz']
    >>> stack.pop()
    'baz'

You can add as many different kinds of methods to your lists or stacks as you need, just make sure you have the common ones that readers of your implementation will expect.

Similarly a "queue" is listed in FIFO order (First-in, First-Out). These are generally a little harder to optimize and again, you could just use a slice. Think job schedulers, paper printing, etc., where the first in line gets popped off first. You still just append items onto the list but the method is usually named Enqueue instead of Push. And the Pop method is named Dequeue. With slices you can just wrap append and you can dequeue using slice expressions.

Finally, it's probably worth mentioning that the Go stdlib comes with a generic doubly linked list implementation under container/list. I'd be wary of using it unless you really needed it since it takes and returns interface{}. It can still be nice for rapidly prototyping a non-slice implementation of a list, stack or queue:

import (
    "container/list"
    "fmt"
)

func main() {
    stack := list.New()
    stack.PushBack(1)
    stack.PushBack("foo")
    fmt.Println(stack.Remove(stack.Back()))  // foo
    fmt.Println(stack.Remove(stack.Back()))  // 1
    fmt.Println(stack.Len())                 // 0
}

That's all for now. This was meant to be a basic tutorial to get you started with learning more about these things before you dive deeper into other material on data structures. I hope it helped.

Sunday, October 8, 2017

Using Amazon Redshift with Looker

This post is an attempt to share some lessons I learned while migrating a Looker instance to Amazon Redshift.

It turns out that Looker isn't just great for Business Users to avoid having to write SQL, it's also great for DevOps to create Admin Dashboards into Redshift that everyone can use to avoid manually writing many complex queries against Redshift. I'll start by setting up Looker/Redshift connections, then Admin dashboards, then end with optimizing it all.

Redshift currently makes sense to use with Looker because it's an OLAP (Online Analytical Processing) database, as opposed to OLTP DBs like MySQL. If you plan to use other AWS services, such as Kinesis Firehose to store event data in Redshift, using Redshift as your data warehouse enables you to perform JOINs on that event data. Redshift already supports Window Functions (great for more real-time/ELT operations with Looker Derived Tables), CTE's (for more readable queries), etc. MySQL 8 will support many of these features but Redshift still has the advantage of being designed as a columnar database while offering additional analytical advantages.

The cloud-database space moves rapidly so keep an eye on Google, Oracle and others as they compete with Amazon. If you have a cloud-native app that uses another provider, that's fine since you can extract all that data in ETL/ELT but the main thing is to have a single data warehouse because Looker should only connect to one warehouse.

Migrating

Fortunately, you don't need to recreate your existing Looker account just to switch connections -- as long as the table and field names will (at least mostly) be the same between environments. To switch connections go to Admin>Database>Connections and create a new connection (leaving your old connection there for now):

The connection name has to be different. (I appended -redshift to the name)
For Dialect put "Amazon Redshift"
For Host:Port just put the host part of your Redshift endpoint, e.g., foo.us-east-1.redshift.amazon.com
For Database enter the name of the production Redshift DB to connect to
Enter the Schema you will connect to. (schemas are just like in Postgres)
You can probably leave the other options as defauls for now

At this point click `test connection`. If it says "Can connect", great. It should be pretty instant. If it takes a while and then fails with: Cannot connect: connection refused: Java::OrgPostgresqlUtil::PSQLException: Connection refused..." then it means Looker doesn't have access to your cluster yet.

If your production Redshift cluster is locked down, such that Looker.com can't connect to it yet, then whitelist Looker's IPs inside your Redshift VPC Security Group (VPC Security Groups work just like EC2 Security Groups). Looker required 5 IPs to be whitelisted, which worked but made it harder to tell what the IPs were for and we already had another IP listed for Kinesis that looked like Type=Redshift, Protocol=TCP, Port Range=5439 Source=<ip>/27. So instead of adding more IPs to the Inbound section, I created a new VPC Security Group called "Looker" and whitelisted those 5 IPs in there:

Redshift TCP 5439 <ip>/32
Redshift TCP 5439 <ip>/32
Redshift TCP 5439 <ip>/32
Redshift TCP 5439 <ip>/32
Redshift TCP 5439 <ip>/32

(Get the list of IPs to whitelist from here.)

I attached that new security group to the Redshift Cluster. This is key because you can't attach the new Looker security group to the Redshift security group (even though you can add it as a tag or type its group ID or name in since that would only act like a tag).

In the Develop menu dropdown, you should have a menu option called Content Validator (If you don't, ask Looker support to add access). Go into it and make sure your current Looker connection is validating. You will use this to find any errors when you switch your connection over to your new Redshift connection and update your LookML.

BTW, I highly recommend considering hosting your own Looker Git repo, so that you can view all of the diffs your team makes to the LookML.

IMPORTANT: before you can switch connections, you'll have to be in development mode and changing connections will BREAK PRODUCTION Looker for now. So don't attempt this until you can bring Looker down for maintenance!

To switch DBs, go to Develop>Manage LookML Projects. Click 'Edit Configuration' for the project you want to change connections for, change to the new connection, and click Save. It will probably say: model '<your model>' is not allowed to use connection '<your schema>' because you still need to update the model file(s) to use your new connection. Go to the model file from Develop>Project. Where it says connection: "<your connection>", change it to your new Redshift connection and click Save.

Next you'll need to update any table or field names in the View files to reference your new name. For example, if your old instances of sql_table_name were referencing a MySQL DB name, they will now need to reference your Redshift schema. Use search to find any other old references to change.

Now you should just need to port any SQL syntax to Redshift. For example, if your old connection was MySQL, look for errors such as The Amazon Redshift database encountered an error while running this query ... and convert it to Redshift syntax. One example I encountered was a query that used SELECT AVG(some_table.paid) which no longer works because Redshift doesn't want you to do an AVG on a bool (whereas the 'paid' field in MySQL was a TINYINT(1). So I cast it by going into its View file where a Measure defined the 'paid' field and changed: sql: ${paid} ;; to sql: ${paid}::int::double precision ;;. Remember: Redshift is not the exact same as Postgres where you could have just cast it to an int. Redshift's AVG() gives you an int result if the input is int. If you're used to using TINYINT(1) bools in MySQL, you can use a Dimension of type number: sql: case when ${TABLE}.paid then 1 else 0 end ;; rather than a yesno so that a sum Measure can just reference ${paid} and an average Measure can just use ${paid}::double precision. YMMV, but keep this in mind.

If you have lot of old LookML, keep going through and fixing all the syntax issues switching to Redshift caused. If your old connection was Postgres then there's not as much to change than if it were MySQL. For example, MySQL date functions such as now() will need to be changed to getdate(), period_diff() will be datediff(), date_format() will be to_char(), etc.

Redshift Admin Dashboards

Looker puts out really cool admin "blocks" that allow both admins and regular users to get insights into things such as ETL cycles, recent load failures, table architectures, and more; all within Looker without having to grant everyone access to the AWS Console.

The recommended way to create a Redshift Admin dashboard and Performance Overview in Looker is to follow these instructions but I will clarify some points that were missing for me. Start by creating a separate DB connection for your Redshift cluster to be used by this block. This step isn't absolutely necessary but it's a good idea since it increases performance because the Admin Block uses PDTs, so having a separate connection will reduce extra query load on your main connection. This connection will mostly be a duplicate, except you'll make a new user just for this connection since this user will have permissions to the metadata tables that the regular Looker user shouldn't. Everything else can be duplicated: the database and even the scratch schema.

I created a user called "looker" as my main connection's user. Don't use an admin user for Looker or it will have access to everything. For instructions on how to do this see this article. I found their explanation of granting SELECT on all tables to Looker inadequate so I did GRANT SELECT ON ALL TABLES IN SCHEMA <my_schema> TO looker;

If you haven't already, create a scratch schema that your PDT's, including the Redshift Admin Block, will use:

psql> CREATE SCHEMA looker_scratch AUTHORIZATION looker;
psql> ALTER USER looker SET search_path TO '$user',looker_scratch,<my_schema>,public;

In Admin>Connections select your connection and click the checkbox to enable PDTs. Under Temp Database enter 'looker_scratch' and click 'Update Connection'. Then click 'test connection'. If you get an error like: Failed to create/write to pdt connection registration table looker_scratch.connection_reg_3 then run:

psql> select has_table_privilege('looker', 'looker_scratch.connection_reg_r3', 'insert');

If that says 'f', then run:

psql> \dt looker_scratch.*

If that says the owner is someone else, what likely happened was that you first made a connection in Looker that had a user of someone else and that user created the connection_reg_r3 table first, so it's still the owner. One way to fix that is:

psql> DROP SCHEMA looker_scratch CASCADE;
psql> CREATE SCHEMA looker_scratch AUTHORIZATION looker;
psql> ALTER USER looker SET search_path TO '$user',looker_scratch,<my_schema>,public;

Now when you click 'test connection' with the 'looker' user, the looker user will be the owner of the look_scratch tables and should now pass with: Can use persistent derived tables in "looker_scratch"

We'll now create a new Redshift user called 'looker_admin' for this admin block to use and give that user access to the redshift tables and scratch_schema it will use:

psql> CREATE USER looker_admin WITH PASSWORD 'some_password';

psql> GRANT SELECT ON TABLE STV_WLM_SERVICE_CLASS_CONFIG TO looker_admin;
psql> GRANT SELECT ON TABLE SVV_TABLE_INFO TO looker_admin;
psql> GRANT SELECT ON TABLE STV_TBL_PERM TO looker_admin;
psql> GRANT SELECT ON TABLE STV_BLOCKLIST TO looker_admin;
psql> GRANT SELECT ON TABLE STL_LOAD_COMMITS TO looker_admin;
psql> GRANT SELECT ON TABLE STL_LOAD_ERRORS TO looker_admin;
psql> GRANT SELECT ON TABLE pg_class TO looker_admin;
psql> GRANT SELECT ON TABLE pg_namespace TO looker_admin;
psql> GRANT SELECT ON TABLE pg_database TO looker_admin;

psql> GRANT ALL ON SCHEMA looker_scratch TO looker_admin;
psql> GRANT ALL ON ALL TABLES IN SCHEMA looker_scratch TO looker_admin;

Looker's SQL Runner will run as the 'looker' user, not 'looker_admin'. So don't be confused if use it with queries like SELECT COUNT(*) FROM stl_load_commits, which will show a count of zero because these are virtual tables that only users with superuser privileges can see. Redshift's convention is to have rows generated by another user show up as invisible, rather than showing an error.

You're now ready to start making the Redshift Admin block work. This requires adding 3 files: A Model called redshift_model, a Dashboard called redshift_admin, and a View called redshift_views.

Make sure the model's 'connection' string points to the separate connection we made above for the looker_admin user. Also make sure to go into Develop>Manage LookML Projects and click 'Edit Configuration' next to the redshift_model and select the same connection for use with it. You should now see the admin dashboard when you go to the dashboard section of your user. It should show you sections like Table Load Summary, Recent Files Loaded, Recent Load Errors, and Database Consumption.

But wait, there are more useful dashboards to add: redshift_performance and redshift_query_inspection. Edit both dashboard's LookML if you need to change query_timezone from their default America/Los_Angeles to something else.

If you end up seeing duplicate dashboards for all your models, it's probably because your other model still does:

include: "*.view.lkml"
include: "*.dashboard.lookml"

which makes it include every dashboard, including your new redshift ones. There currently isn't a way to exclude redshift_ from the includes, so come up with a new naming convention like how the redshift_ LookML does. I changed the include in my main model to use a prefix "somePrefix_", include: "somePrefix_*.dashboard.lookml". This meant I needed to create any LookML dashboards to have a prefix now and for consistency I did the same for the Views. It also meant going into my main Model and changing all references to the views to use the prefix. It's really annoying but not as annoying to me as having duplicate dashboards.

After doing that, some of your existing Looks and Dashboards in the 'Browse' section will likely be broken. Go into Develop>Content Validator and click 'Replace' next to any errors and rename the explores to your new prefix. Once you commit, if you get any LookML errors fix them before you push. For example, I had a couple of Views referencing ${foo} where I needed to change it to ${somePrefix_foo} now. Don't blindly do that for every ${foo}, just errors, because most ${foo}'s are probably for Dimensions which don't need prefixes but sometimes a ${foo} is actually a reference to a View from another file. Once you're resolved these, click 'Validate Again' until the errors are gone.

Finally, you might want to adjust the time of day these Redshift Admin PDTs rebuild. Go into 'redshift_views' and change the value of sql_trigger_value.

Optimizing

Let me start by saying that I am not a Redshift expert and that the following are from my own understanding of the docs and my own experiments. Nonetheless, it may help your understanding because these are the topics that tend to trip up developers the most.

Whether or not you're using a star-schema design with Redshift, you can still leverage Looker and optimize your tables in similar ways. For example, we can see which sortkeys and distkeys a table has with:

$ psql [cluster credentials]
# set search_path to '$user', public, <my_schema>
# select "column", type, encoding, distkey, sortkey from pg_table_def where tablename = 'foo';

However, it's useful to first know which queries are actually slow. To do this, log into the AWS Console and go to your Redshift cluster and then the Queries tab. When you find a slow query, try running it directly, e.g. in Looker's SQL Runner or via the psql command. However, keep in mind that first runs of your query will be slower because Redshift compiles and then caches them. Therefore, you'll want to do your benchmarking against the warm cache versions.

While you're still in the Queries tab, click on the actual query and then under Query Execution Details you'll see the "Explain plan". Look for the tab called Actual. Now you can click on the plan's steps to drill down further and see if there are any warnings or how many rows are being scanned - which is indicative of your query reading too many rows from disk, so you may need to change your sortkeys or run a VACUUM. You may also need to adjust your query to filter better...

Redshift is a columnar database, which means that each block holds data for only a single column. The "zone map" is held separately from the block, like an index. The zone map holds only two data points per block, the highest and lowest values in the block. Redshift consults the zone map when executing queries, and excludes the blocks that the zone map indicates won’t be returned by the WHERE clause filter.

You might notice that your ad-hoc queries run perfectly fast in warm cache, but perhaps your Look or Dashboard in Looker is much slower. This is probably because Looker's visualizations can be slow to render in the browser. Fortunately, once you (or anyone else) loads something in Looker, it will remain cached until your caching-policy resets the cache, e.g. once per day after ETL run.

Aside from incorrect queries, the general place to look into why queries are slow in redshift is usually with a distkey/diststyle and second most would be sortkeys. The distyle of EVEN is the default and means "distributed across all the nodes". The default is generally fine because you don't have bottlenecks or 90% disk space and CPU on one node while other nodes are empty. It makes all nodes participate to the best of their ability. However, it can take longer to do joins.

Unfortunately, it's not easy (or even realistic) to know ahead of time which distkeys and sortkeys to add because you have no way of knowing which queries people will plan to write against your tables a year from now. And since you have to add distkeys and sortkeys at table creation time, there are two main things you'll need to do: 1) Make it as easy as possible to change your schema, so you can edit the keys and repopulate the data 2) Use EXPLAIN on the queries you do have, to figure out which columns could benefit from different keys.

When you run JOINs, think in terms of distributed joins. Say you have a query like:

EXPLAIN SELECT groups.name COUNT(DISTINCT u_id)
FROM groups JOIN users ON groups.g_id = users.g_id
GROUP BY 1;

and EXPLAIN output shows Hash Full Join DS_DIST_BOTH. It means both outer and inner, which redistributes part of the data, allocating different ranges to different nodes. This is usually the default in Redshift if you don't optimize, and it's one of the least optimized! So you will want to add distkeys and sort keys. Also, BCAST (broadcast) means "duplicate the entire table to all the nodes" but isn't seen as often yet some queries need broadcast. Most use DIST (distributed) which is generally better to use diststyle: ALL to broadcast once, rather than every query run.

Why is this so important? Redshift is not designed to run on a single node. It's designed to run in a sharded cluster and it's expected to have very bad numbers within only one node. This is one reason you'll get slow queries. In MPP databases, data is partitioned across multiple servers & nodes with communication across a network rather than disk. The nice thing is that your SQL queries will look as though you're selecting from a single database, this is thanks to using a "distribution key".

Say you commonly JOIN or GROUP BY using the 'state' column of your table. You may choose to distribute on that column. However, if you set the DISTKEY to the state column the distribution won't be as even. In a large table, this amount of distribution "skew" could have an adverse impact on query processing. From the columns used in your queries, choose a column that causes the least amount of skew as the DISTKEY. A column with many distinct values, like a timestamp, would be a good first choice. Avoid columns with few distinct values such as credit card types, error messages or days of the week.

While the general rule of thumb is to add distkeys to the fields you'll be JOIN'ing on the most and to add sortkeys to fields you'll be filtering the most in WHERE clauses, you need to benchmark before/after and really look into the EXPLAIN output to be sure you're not making things worse. There is more to consider than just query speed, such as CPU usage and other things that can cause problems or drive up your AWS bill.

Pick a few important queries you want to optimize your databases for. You can’t optimize your table for all queries, unfortunately. Even though it will rarely be the best performance, a table with no DISTKEY/SORTKEY can be an ok all-round performer. It’s a good option not to define DISTKEY and SORTKEY until you really understand the nature of your data and queries.

That all said, sortkeys are less likely to backfire. If you don't use sortkeys, every query you'd do has to read all the files from disk (like a full table scan) before it can decide if it's something it can use. This sortkeys are comparable to indexes. If you use sortkeys you can more easily reason about and measure the min/max values of each of the 1mb blocks, so you can skip blocks. So when you WHERE to filter by some value, Redshift wouldn't have to read everything to do it. Take:

SELECT COUNT(1) FROM users WHERE state = 'California'

Most of the time of this query is spent in 'Seq Scan on users'.

The Redshift table STL_EXPLAIN gives you actual EXPLAIN output of all your queries. This is very useful because you can audit this list rather than figuring out all the queries and manually put EXPLAINs in front of them. It shows the time cost of query plans. Don't pay too much attention to the number, but you can use it to reason about what is probably the most time-consuming task. The number at the top contains the numbers at the bottom:

Plannode           | cost   | info
Aggregate          | 68718  | 
  Seq Scan on users | 62500  | Filters: state = 'California'

See also the STL_SCAN table. Helps you figure out which columns will contribute the most to optimizing. It helps you see what are the top columns that take a lot of time being scanned and have the greatest potential for improvement. Fortunately, we created some Admin Dashboards above in this post, so that you can peer into all of this more easily.

Sunday, January 3, 2016

Data Pipeline and ETL tasks in Go using Ratchet

As Data Engineers and Software Engineers we might define Data Pipelines differently. This Wikipedia entry defines a pipeline as a set of data processing elements connected in series, where the output of one element is the input of the next one; often executed in parallel. That is pretty close to how I'll define it here. More specifically, I'll use The Golang Blog's informal definition:

A series of stages connected by channels, where each stage is a group of goroutines running the same function. In each stage, the goroutines receive values from upstream via inbound channels; perform some function on that data, usually producing new values; and send values downstream via outbound channels.

Ratchet is a Go library that abstracts these pipeline concepts, giving you a clear-cut framework to work with. You will satisfy a data processing interface and be provided with some convenience methods for managing the flow of data.

Perhaps you have been tasked with creating automated pipelines that fetch data from disparate sources -- such as remote databases or web services -- and then cleaning up, transforming and aggregating the unorganized data into specialized databases or data sources (an RDBMS, CSV files, email queues, etc.) so your team can use that organized data for analytics and other purposes since most programs are a form of ETL.

Ratchet is an open-source project by Daily Burn. It provides a better way of extracting data from the databases and web services an application uses, then transforming and loading that data into reporting-oriented formats. It makes doing these tasks more robust than creating basic scripts or using limited 3rd-party services that don't give you full control over your data processing.

SETUP

You can find code examples here.

Install ratchet by running:

go get github.com/dailyburn/ratchet

Example project layout (There's no required convention):

$GOPATH/src/<YOUR REPO>/<YOUR USER>/<YOUR PROJECT>
├── main.go              Data Processor initialization & Pipeline
├── packages             Your non-main packages, e.g. mypkg
│   ├── models.go        JSON structs for transforms.go
│   ├── queries.go       SQL query functions for SQLReaders
│   └── transforms.go    Custom DataProcessors

Ratchet consists of a Pipeline with a series of PipelineStages, which each handle one or more DataProcessors. DataProcessors each run in their own goroutine so all of the data is processed concurrently. The DataProcessors send and receive JSON for convenience. Ratchet gives you some useful data processors and an interface for you to implement custom ones.

Your Ratchet tasks will likely use a combination of the provided DataProcessors and your own custom ones. If you're creating an ETL task, the custom DataProcessors will usually be for the Transform stage while the Extract and Load stages could use the built-in Reader/Writer DataProcessors such as SQLReader and SQLWriter. But there are times when you will create custom Data Processors even for the extract stages, such as when you're making calls to 3rd party REST Apis, and the like.

There are other DataProcessors provided for your convenience, such as ones for Google's BigQuery and more. You will get to mix-and-match things as they make sense to your application. See the full-list of provided DataProcessors.

You will typically begin your Ratchet code by including the necessary packages. For the sake of example I will assume you are also going to access an RDBMS, such as MySQL, to perform some SQL reads and/or writes on:

package main

import (
    "github.com/dailyburn/ratchet"
    "github.com/dailyburn/ratchet/logger"
    "github.com/dailyburn/ratchet/processors"
    "github.com/<YOUR_USER>/<YOUR_PROJECT>/packages"
    "database/sql"
    _ "github.com/go-sql-driver/mysql"
)

Note that for PostgreSQL you'd import _ "github.com/lib/pq".

For the examples below we will be assuming we're extracting data from a MySQL database `srcDB`, querying its `users` table, transforming the data, and finally outputing the transformed data to a table `users2` in the destination database `dstDB`.

EXTRACT

Since an SQLReader takes a *sql.DB as the first parameter, you'll start by creating one, as we do with the setupDB function below. You'll pass that and an SQL query string to NewSQLReader to create an SQLReader for the "extract" stage:

func NewSQLReader(dbConn *sql.DB, sql string) *SQLReader

For example, in queries.go put functions that return your SQL strings:

package mypkg

func Query(minId int) string {
    return fmt.Sprintf(`SELECT id, name FROM users 
        WHERE id >= %v`, minId)
}

We simply created a function that takes arguments, generates some SQL and return a string. We can call it from main.go like:

func main() {
    inputDB := setupDB("mysql", "root:@tcp(127.0.0.1:3306)/srcDB")
    extractDP := processors.NewSQLReader(inputDB, mypkg.Query(5))
}

func setupDB(driver, conn string) *sql.DB {
    db, err := sql.Open(driver, conn)
    if err != nil {
        logger.Error(err)
        panic(err)
    }
    return db
}

Here extractDP (DP meaning DataProcessor) is of type *processors.SQLReader.

If you only need a static SQL string then NewSQLReader does the job. For a query to use the JSON results from a previous stage, use NewDynamicSQLReader.

TRANSFORM

When you need to create your own data processors, such as when creating transformers, you implicitly satisfy the DataProcessor interface by implementing its ProcessData and Finish methods. You will create a struct and attach these methods to it. It's up to you to determine how your structure will hold any state you need. Also, because ProcessData deals with receiving JSON from the previous stage and then passing JSON on to the next stage, you will need to create structs for Unmarshaling and Marshaling the JSON.

The data package includes some types and functions for using JSON with Ratchet. It provides wrapper functions for json.Unmarshal and json.Marshal called data.ParseJSON and data.NewJSON that provide additional logging. It also creates the type data.JSON as a simple name to a byte slice: type JSON []byte.

For example, put the following in transforms.go:

package mypkg

import (
    "github.com/dailyburn/ratchet/data"
    "github.com/dailyburn/ratchet/util"
)

type myTransformer struct{}

// Expose our DataProcessor for clients to use
func NewMyTransformer() *myTransformer {
    return &myTransformer{}
}

// I like to move these structs to a file called models.go
type ReceivedData struct {
    ID int `json:"id,omitempty"`
}
type TransformedData struct {
    UserID         int    `json:"user_id,omitempty"`
    SomeNewField   string `json:"some_new_field"`
}

func (t *myTransformer) ProcessData(d data.JSON, 
                                    outputChan chan data.JSON,
                                    killChan chan error) {

    // Step 1: Unmarshal json into slice of ReceivedData structs
    var users []ReceivedData
    var transforms []TransformedData 
    
    err := data.ParseJSON(d, &users)
    util.KillPipelineIfErr(err, killChan)

    // Step 2: Loop through slice and transform data
    for _, user := range users {
        transform := TransformedData{}
        transform.UserID = user.ID;
        transform.SomeNewField = "whatever"
        transforms = append(transforms, transform)
    }

    // Step 3: Marshal transformed data and send to next stage
    if len(transforms) > 0 {    
        dd, err := data.NewJSON(transforms)
        util.KillPipelineIfErr(err, killChan)
        outputChan <- dd
    }
}

func (t *myTransformer) Finish(outputChan chan data.JSON,
                               killChan chan error) {}

Notice the idiomatic NewMyTransformer function that returns a pointer to our DataProcessor's struct with its zero value. Since it starts with a capital letter, it will be used by package users. For example, in main.go:

transformDP := mypkg.NewMyTransformer()

Our transformer changed the name `id` to `user_id` for the destination table. It also omitted the `name` field and used a new field `some_new_field` that wasn't present in the source table.

LOAD

To load the data you've extracted and optionally transformed, you'll create NewSQLWriter processor:

func NewSQLWriter(db *sql.DB, tableName string) *SQLWriter

Example:

outputDB := setupDB("mysql", "root@tcp(127.0.0.1:3306)/dstDB")
outputTable := "users2"
loadDP := processors.NewSQLWriter(outputDB, outputTable)

RUNNING THE PIPELINE

Once you have your data processors setup, you will just need to pass them into a new pipeline for processing. If you just have one reader, one transformer and one loader, you can use a basic 3 stage pipeline using NewPipeline:

pipeline := ratchet.NewPipeline(extractDP, transformDP, loadDP)
pipeline.Name = "My Pipeline"
 
err := <-pipeline.Run()
if err != nil {
    logger.ErrorWithoutTrace(pipeline.Name, ":", err)
    logger.ErrorWithoutTrace(pipeline.Stats())
} else {
    logger.Info(pipeline.Name, ": Completed successfully.")
}

If you do not need a transform stage you can have just Extract and Load stages: ratchet.NewPipeline(extractDP, loadDP). This is enough if your extracted SQL fields match and table you will load the data into.

Things aren't always that simple so Ratchet provides a more flexible branching pipeline, via NewPipelineLayout, that takes a variable amount of NewPipelineStage's which in turn take a variable amount of DataProcessors. Inside the pipeline stages you will wrap your data processor instances with the Do function and pass its returned value to the Outputs method, following these rules:

DataProcessors in a non-final PipelineStage must use Outputs.
A DataProcessor must be pointed to by one of the previous Outputs (except in the first stage).
Outputs must point to a DataProcessor in the next immediate stage.
DataProcessors in the final stage must not use Outputs.

Here's how the basic 3 stage pipeline shown above would look as a branching pipeline:

layout, err := ratchet.NewPipelineLayout(
    ratchet.NewPipelineStage( // Stage 1
        ratchet.Do(extractDP).Outputs(transformDP),
    ),
    ratchet.NewPipelineStage( // Stage 2
        ratchet.Do(transformDP).Outputs(loadDP),
    ),
    ratchet.NewPipelineStage( // Stage 3
        ratchet.Do(loadDP),
    ),
)

pipeline := ratchet.NewBranchingPipeline(layout)
err = <-pipeline.Run()

Fortunately, you can do a lot more than that with a branching pipeline. Outputs can take multiple parameters to shove data into and you will often have multiple calls to Do in different stages, for handling disparate data. Sometimes you'll have 4 or 5 stages, output to multiple transformers, batchers, etc.

Example of how subscriptions & refunds layout stages might look in a real-world ETL job:

ratchet.NewPipelineStage(
    ratchet.Do(users).Outputs(subs, refunds),
),
ratchet.NewPipelineStage(
    ratchet.Do(subs).Outputs(batch),
    ratchet.Do(refunds).Outputs(batch),
),
ratchet.NewPipelineStage(
    ratchet.Do(batch).Outputs(subsTransformer, refundsTransformer),
),
ratchet.NewPipelineStage(
    ratchet.Do(subsTransformer).Outputs(writeSubs),
    ratchet.Do(refundsTransformer).Outputs(writeRefunds),
),
ratchet.NewPipelineStage(
    ratchet.Do(writeSubs),
    ratchet.Do(writeRefunds),
),

As you can see, Ratchet is very flexible. Here's another example layout - this time using a custom DataProcessor to get JSON results from a RESTful Shopify API and pass data to two different transformers. This works because your first stage doesn't have to be an SQLReader, all that matters is that it sends and receives JSON:

// some custom DataProcessor you make up to query Shopify
shopifyStage := shopify.NewShopifyOrders(shopifyAPI)
...
layout, err := ratchet.NewPipelineLayout(
    ratchet.NewPipelineStage(
        ratchet.Do(shopifyStage).Outputs(orderTransformer),
        ratchet.Do(shopifyStage).Outputs(orderItemTransformer),
    ),
    ratchet.NewPipelineStage(
        ratchet.Do(orderTransformer).Outputs(writeOrders),
        ratchet.Do(orderItemTransformer).Outputs(writeOrderItems),
    ),
    ratchet.NewPipelineStage(
        ratchet.Do(writeOrders),
        ratchet.Do(writeOrderItems),
    ),
)

Sometimes you will want to pass the original object through the various stages in case you need to use it in a later stage. This is very easy to do using NewPassthrough DataProcessor. In the following example, passing passthrough to Outputs makes it so aDP gets passed to the next stage along with the other values passed to Outputs:

passthrough := processors.NewPassthrough()

...

ratchet.NewPipelineStage(
    ratchet.Do(aDP).Outputs(anotherDP, passthrough),
),
ratchet.NewPipelineStage(
    ratchet.Do(anotherDP).Outputs(yetAnotherDP),
    ratchet.Do(passthrough).Outputs(yetAnotherDP),
...

BATCHING

Finish is called by Ratchet after a previous stage is done sending its data. You can often implement this as an empty method. Finish is more useful when you want to wait until all the data has been received before doing something with it, and if you're working with more than one input source you may really want this. If that's the case, you will typically use ProcessData for validating and storing the incoming data into the receiver struct and then doing the second and third steps above inside of Finish instead.

Here is an example of how we could rewrite the above to batch things using a non-empty struct and a pointer receiver, then complete the transformation stage in the Finish:

type myTransformer struct{
    BatchedUsers []User
}

func NewMyTransformer() *myTransformer {
    return &myTransformer{}
}

func (t *myTransformer) ProcessData(d data.JSON, 
                                    outputChan chan data.JSON,
                                    killChan chan error) {

    // Step 1: Unmarshal the JSON into a User slice
    var users []User
    
    err := data.ParseJSON(d, &users)
    util.KillPipelineIfErr(err, killChan)

    // Step 2: append via pointer receiver
    t.BatchedUsers = append(t.BatchedUsers, users...)
}

func (t *myTransformer) Finish(outputChan chan data.JSON,
                               killChan chan error) {

    var transforms []TransformedData

    // Step 3: Loop through slice and transform data
    for _, user := range t.BatchedUsers {
        transform := TransformedData{}
        transform.UserID = user.ID;
        transform.SomeNewField = "whatever"
        transforms = append(transforms, transform)
    }

    // Step 4: Marshal transformed data and send to next stage
    // Write the results if more than one row/record.
    // You can change the batch size by setting loadDP.BatchSize
    if len(transforms) > 0 {
        dd, err := data.NewJSON(transforms)
        util.KillPipelineIfErr(err, killChan)
        outputChan <- dd
    }
}

DEBUGGING

Ratchet provides a logging facility that's very useful for debugging Ratchet tasks. It is often helpful to temporarily place the following line into your ProcessData implementation so that you can see output from the calls to logger.Debug:

logger.LogLevel = logger.LevelDebug

Even better is for you to create a way to set the different levels from the CLI. For example: --log-level="debug". And so on for LevelError, LevelInfo, LevelStatus, LevelSilent. I recommend using LevelSilent on production or whenever you just need the job to run faster.

Add pipeline.PrintData = true along with LevelDebug to set the pipeline to show all data payloads received when logger is in Debug mode. This is a very convenient way to see all of the data flowing through the different stages.

Be sure to read the log output (ratchet_default.log) because it shows the SQL INSERT Data that was actually executed, for example. The logger package provides some other useful logging functions for you to use as well.

Another debugging/development tip is to temporarily write the final stage output to a CSVWriter that goes to standard output. This allows you to quickly prototype a Ratchet task without having to set up or write to your final database table(s) yet. Example:

// Setup all the DataProcessors
users := processors.NewSQLReader(inputDB, mypkg.Query())
bq := processors.NewDynamicBigQueryReader(bqConfig, mypkg.Query)
bq.UnflattenResults = true
transformer := mypkg.NewMyTransformer()
writeCSV := processors.NewCSVWriter(os.Stdout)

// Create a new Pipeline using the DataProcessors
layout, err := ratchet.NewPipelineLayout(
    ratchet.NewPipelineStage(
        ratchet.Do(users).Outputs(bq),
    ),
    ratchet.NewPipelineStage(
        ratchet.Do(bq).Outputs(transformer),
    ),
    ratchet.NewPipelineStage(
        ratchet.Do(transformer).Outputs(writeCSV),
    ),
    ratchet.NewPipelineStage(
        ratchet.Do(writeCSV),
    ),
)

TIPS

Make sure to add the String method to your transformer because the log output by default is to show all the data in the struct, which shows too much output and slows everything down:

func (t *myTransformer) String() string {
    return "myTransformer"
}

Tip: Creating structs for all the JSON you will be processing can be tedious so I highly recommend the tool json2go.

I have covered all of the main concepts. Please see the Ratchet documentation. Some things are currently undocumented, such as a 3.0 branch that adds support for Amazon Redshift so see the source code and GitHub repository for more information. If you have any questions please post them in the comment section below.

Sunday, May 31, 2015

Full-Site HTTPS Lessons

There may be more to consider than you think before attempting to convert your web site to HTTPS. Even if you already use HTTPS, you might benefit from this information if you aren't sure you set it up properly. I will cover both the obvious and the not-so-obvious things you need to know so you can avoid common pitfalls and plan out your conversion more completely.

Every web site these days should support https:// on every page in order to be more secure and to provide privacy to its users. It can be be very easy or fairly difficult to convert an existing site to HTTPS. If it happens to be difficult for your site, don't be discouraged. It will benefit everybody because your site will be more secure and enjoy a better reputation.

Google has done a great job of explaining why you should use HTTPS on all your pages (HTTPS Everywhere). Man-in-the-Middle aren't just easy if you aren't using HTTPS at all, but also if you're not using it everywhere or properly. For example, encrypting passwords isn't enough if your cookies aren't secure, and your cookies can't be secure if you're not using https everywhere. We don't want our faithful users to have their sessions hijacked while they're browsing our site at a coffee shop.

What I would like to talk about are the lessons that I learned from setting this up at about.me, which was an existing site with millions of users at the time. I'll start by saying that you should not procrastinate on doing this. It is much easier to set up HTTPS everywhere if you start early but it is possible to do it on any website -- it will just take more work.

I recommend reading this entire article once through first, before trying any of it out. Use this information at your own risk since it is just from my experience and your mileage may vary.

The first thing you'll need is an SSL certificate. Find out if you will need a wildcard cert and so forth. Your site without a subdomain is called a "naked domain" (e.g. example.com). You may need to ensure your certificates include subdomains to avoid site visitors getting certificate errors if they go to the www. address for your site - or maybe you handle this by redirecting www to the naked domain in your web server config. And while not necessary, you should also decide if you want to use Extended Validation (EV) certificates. Anyway, do your homework before purchasing your certificate.

Once you have your certificates figured out, you start by making every page redirect from http to https. This is usually as simple as adding the appropriate redirect directives to wherever you terminate SSL (NGINX, Apache, etc). In NGINX this is usually as simple as adding return 301 https://$domain$request_uri; to your HTTP server block; then in your HTTPS server block you'll add directives to enable ssl, listen on port 443, link to your certificate and key and add any needed headers. Refer to your web server's documentation.

It is a myth that having https on every page is going to be too slow. So please do not try to only add https to a subset of pages, such as just the home page or login page. Doing so will only add complexity to your configurations without making your site secure enough.

In your web server config, you will also specify the protocols to support. You should disable support for SSL v2 and v3. With NGINX, just specify the protocols you want, and leave out what you don't want, otherwise it defaults to SSSv3. So do something like: ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

It may seem confusing that we're using the terms SSL and TLS interchangeably. Just note that you'll want to configure support for TLS encryption, not SSL. See also There IS NO SSL.

We're about half-way there. Unfortunately, simply having every web page be HTTPS isn't enough. You also need to have all your important resources that load on those pages (images and scripts) be https or you aren't really secure yet. Plus if you don't do this, you will see a lot of Mixed-Content warnings in the browser in the console and a warning on the lock icon. So having https:// in the address bar isn't enough. Don't try to cheat because you will not fool attackers nor your users.

You should start by first configuring this all on your dev machine (or at least a staging machine), rather than production. This is because you will likely encounter errors that can take a while to resolve and you don't want to leave your site in a broken state. For example, if it turns out that your site has hundreds of hard-coded http:// URL strings in the database then you'll need first run a migration script to convert these to "https://". (Yes https://, because protocol-relative // urls are now an anti-pattern.)

If your site has hundreds of thousands of users, then running such a migration script could take hours or even days. So you will want to run a migration script first if needed. Once your migration finishes, you will also need to refactor your code to start saving your URLs as https so you don't run into the same problem again. Depending on when you deploy, you may need to run your migration script once more to catch up.

DO NOT skip this step: To prevent session hijacking, you'll want to enable the Secure and HttpOnly attributes on your session cookies. These are security flags that you add to your cookies by appending 'Secure; HttpOnly' to the cookie HTTP response header. The Secure flag makes the cookie only accessible over HTTPS so users can't unknowingly "leak" their cookies by typing http:// to get to your site before it redirects to https://. The HttpOnly flag makes cookies invisible to client-side scripts so cookies can't be stolen by XSS exploits.

To test that your cookies have the Secure and HttpOnly flags, simply check the Set-cookie value in the response from curl'ing your website. In a browser such as Google Chrome, you can also go into the developer tools and inspect your cookies and you should see checkmarks under the Secure and HttpOnly columns.

Your development sandboxes should use https:// since it's best that development machines closely match production. This will help prevent developers from hard-coding http:// url's but not realize it until there are https warnings on production. This means that your developer machines will need to be running NGINX or whatever web server you use in production. Creating a self-signed certificate will make this easy enough. There are additional benefits to this, such as using a hostname like https://yoursite.dev instead of http://localhost:3000.

Lastly, make sure that any middleware and front-end code that does redirects, does so as https. You can test this on the command-line by running:

 $ curl -I -L https://yoursite/whatever

and checking if any Location: lines show http:// URLs. Assuming that /whatever is a controller that does redirects, you want to be sure that it doesn't do any redirecting back to http:// at any request. I've had to ensure https:// redirects at the WSGI-middleware and at the front-end controller code layers.

To further test that you set everything up correctly, run your domain through a scanner such as https://www.ssllabs.com/ssltest/. It will give you different grades and your goal is to get an 'A'. For a command-line tool, check your package installer for sslscan (Homebrew, apt-get, etc). Additionally you can check that your site uses strong certificates by using https://shaaaaaaaaaaaaa.com/. If you get bad scores, make sure you're using SHA-2 certificates and 256-bit TLS ciphers.

Further considerations are making sure you disable support for weak ciphers, using at least SHA-2 (even on your intermediate certs) and looking into SSLStrip and HSTS headers. Also, look into Perfect Forward Secrecy (PFS) so attackers can't use private key to decrypt archived sessions. Finally, make sure your supported browsers support everything you have set up.

You should also start linking to the https:// version of your site -- in emails and so on -- so your servers don't incur any unnecessary HTTP redirects and because you should be proud that you can provide https links to your web site.

So please get started on this as soon as possible. Don't put off the inevitable.

UPDATE: Further reading: Let's Encrypt and QUIC.

Saturday, April 26, 2014

Using Browserify with jQuery, Backbone.js and more.

This article aims to help you understand how to use Browserify in your projects to manage your front-end dependencies, as well as your back-end ones. It assumes you know what Browserify does and that you are confused about exactly how to use in a real project to handle all of your dependencies, particularly with client-side frameworks. I will show examples with jQuery and Backbone.js, but what I explain will apply somewhat to other frameworks and libraries too.

Using Browserify with jQuery

A good first step toward understanding how to use browserify is with jQuery. Delete any script tags that include jquery.js in your html on the client-side. Instead, install the jquery node module (into ./node_modules) and save a reference to it in package.json:

 $ npm install jquery --save

jQuery 2.1.0 is capable of using browserify's module.exports. That is, you can install the npm package called "jquery" instead of the older workarounds such as the "jquery-browserify" npm package.

jQuery will need a window object:

 var $ = require('jquery');

Older jQuery's will need to do this as var $ = require('jquery')(window);or var $ = require('jquery/dist/jquery')(window);

If you omit the relative path it'll resolve to the module in node_modules, allowing us to just require('jquery') rather than specify a path to jquery.

Whenever any of your JavaScript files need a reference to $ or jQuery you simply create them as needed, with the above jquery require line or with var jQuery = require('jquery'); if you use the name jQuery instead of $.

Now you browserify your code:

 $ npm install -g browserify
 $ browserify whatever.js -o bundle.js

You'll then use a script tag to include bundle.js instead of jquery.js. That's it!

Using Browserify with Backbone.js

Install Backbone.js, Underscore.js and jQuery as node modules:

 $ npm install backbone underscore jquery --save

We explicitly npm installed underscore, instead of letting backbone install it, so that we can require('underscore') later when needed. And we used the --save flag to store these versions in package.json because it's a good habit to get into.

If you structured your Backbone code using a single global object, like app, then you'll need to replace all the lines (if any) that say:

 var app = app || {};

with something like:

 var $ = require('jquery');
 var Backbone = require('backbone');
 var _ = require('underscore');
 Backbone.$ = $;

in all of your JavaScript files. Each file may only need certain objects, and so we should only put the ones each module needs. Being explicit about which dependencies are needed is part of what good modularity is all about. For example, if you only need a reference to Backbone in one file then only do:

 var Backbone = require('backbone');

Backbone can run fine without jQuery, but you can explicitly tell Backbone to use jQuery when needed by doing Backbone.$ = $;

If you explicitly need the '_' object because you're invoking underscore directly, then also include the var _ = require('underscore');

Because of the way Browserify encapsulates every module into its own scope, we no longer have to pass things like jQuery into an IIFE to avoid global conflicts, and we can even put 'use strict' at the TOP of our modules. So instead of:

 var jQuery = require('jquery');
 var Backbone = require('backbone');
 Backbone.$ = jQuery;

 (function($) {
     'use strict';

      module.exports = Backbone.View.extend({
        ...
 )(jQuery);

we can simply write:

 'use strict';

 var $ = require('jquery');
 var Backbone = require('backbone');
 Backbone.$ = $;

 module.exports = Backbone.View.extend({ ... });

Notice how the above code also needed var Backbone = require('backbone'); because it references Backbone in this file. Everything not in module.exports will stay private to its module. In Node.JS, module.exports is simply the object that gets returned as the result of a require call.

Next delete any script tags that include your user-defined Backbone javascript files and replace with just:

 <script src="js/bundle.js"></script>

Generate the bundle.js file

 $ browserify ./public/js/app.js -o public/js/bundle.js

If we were using the global app object instead of browserify, we'd have to make sure js/views/app.js gets included before js/app.js in the script tag include order. But with browserify we take care of the ordering differently. For example, in public/js/app.js put:

 var $ = require('jquery');
 AppView = require('./views/app');
 // The only thing that should be in a DOMReady
 $(function() {
     new AppView();
 });

Then in public/js/views/app.js put:

 'use strict';

 var Backbone = require('backbone');
 var $ = require('jquery');
 Backbone.$ = $;

 // examples of requiring some user-defined Backbone stuff
 var Todos = require('../collections/todos');
 var TodoView = require('./todo');

 //var app = app || {};

 //app.AppView = Backbone.View.extend({
 module.exports = Backbone.View.extend({

     initialize: function() {
         //this.listenTo(app.Todos, 'add', this.addTodoTask);
         this.listenTo(Todos, 'add', this.addTodoTask)
         ...

Notice how I commented out the old app global object stuff and replaced it with Browserify'd code. You basically just need to replace app.whatever with module.exports = whatever. And for every file that used to reference the global app object, you now need to require() that file and store it in a local object. So continuing the example, above '../collections/todo.js' might contain:

 'use strict';

 var Backbone = require('backbone');
 var Todo = require('../models/todo');
 //var app = app || {};

 var TodoList = Backbone.Collection.extend({
     //model: app.Todo,
     model: Todo,

 });

 module.exports = new TodoList();

Again I commented out the global app object to show that's where browserify's stuff goes.

Since a Backbone.js project can become large, or any project for that matter, it's a good idea to install watchify. It takes the same command-line arguments as browserify but will automatically update the bundle.js file whenever you make changes, and it will update it faster since it only updates the changes.

Additionally, using the power of Browserify transforms, you can integrate Bower packages and more. This is useful if you need a dependency that uses CSS or something that Browserify isn't yet suited for. Perhaps I'll write more about this next time.

Monday, March 17, 2014

Headless Dropbox

You can install Dropbox on a Linux server or any Linux system that doesn't have an external monitor or a GUI Windowing System.

I recently installed Dropbox on my Digital Ocean Droplet, running Ubuntu 12.10 server. I did not figure this all out on my own. I read the sources mentioned at the bottom of this post. However, some of the information was incomplete or contained broken links, and I ran into other issues. So this blog post is more of about how I got it working in hopes that you won't encounter the same issues I did.

First you will need to link your Linux machine with your Dropbox account by downloading and running a Dropbox server script called dropboxd:

 $ cd  # changes to your $HOME directory
 $ wget -O - "https://www.dropbox.com/download?plat=lnx.x86" | tar xzf -
 $ ~/.dropbox-dist/dropboxd

 Use the above command if you're running "uname -m" says i686.
        
 $ cd
 $ wget -O - "https://www.dropbox.com/download?plat=lnx.x86_64" | tar xzf -
 $ ~/.dropbox-dist/dropboxd

 Use the above command if "uname -m" says x86_64.

The running dropboxd command should output a link for you to paste into a browser to link that system. I had issues with this at first and just had to keep rerunning the command over time and leaving it running until it finally started saying something. This was probably due to connection issues that you hopefully won't have. You can then verify that it's linked by logging into your dropbox web account and going under the "Security" tab to see linked devices.

The link uses a token so you don't have to store your Dropbox password on the Linux instance.

Once Dropbox is linked you should see your ~/Dropbox directory starting to sync. At this point, you can temporarily stop the syncing by hitting ^C (Control-C) to stop dropboxd. Don't worry we'll finish syncing in a minute.

Let's download a nice python script to help us manage Dropbox better:

 $ wget -O ~/bin/dropbox.py "http://www.dropbox.com/download?dl=packages/dropbox.py"
 $ chmod 755 ~/bin/dropbox.py

and download a bash script that we'll use to start and stop Dropbox:

 $ wget -O ~/bin/dropbox_temp "https://gist.githubusercontent.com/benhedrington/\
 2347727/raw/108fc8af551cb4fdf7cdd08b891a45f405d283dc/dropbox"
 $ chmod u+x ~/bin/dropbox_temp

Edit ~/bin/dropbox_temp in Vim (or whatever editor you like) and change DROPBOX_USERS="user1 user2" to whatever user(s) on your system you want to run dropbox clients for.

Now move dropbox_temp script:

 $ sudo mv ~/bin/dropbox_temp /etc/init.d/dropbox

and have it start when we boot:

 $ sudo update-rc.d dropbox defaults

Close the dropboxd script if you still have it running manually: ^C.

Check if dropbox is running, if it's not then start it:

 $ sudo service dropbox status
 $ sudo service dropbox start

If it's is running then run:

 $ dropbox.py status

If the status command says Dropbox is now downloading or syncing a ton of files, you might want to remove some directories from being sync'd to this linux machine by using Dropbox's Selective Sync.

Selective Sync allows you to exclude certain folders that you don't want to sync to your Linux server, e.g., maybe you don't need all your photos on it:

 $ cd ~/Dropbox
 $ ls
 Photos Projects Public iphone-photos

Let's exclude the photos and Public directories:

 $ dropbox.py exclude add Photos
 $ dropbox.py exclude add iphone-photos
 $ dropbox.py exclude add Public
 $ ls
 Projects

At this point, I would wait until dropbox.py status says "Up to date" before editing files in your Dropbox folder. Just to make sure nothing gets corrupted. Once Dropbox has finished downloading everything, you should try to sync files to AND from Dropbox. Make sure to test both directions and also test nested directories and your most important directories. I had an issue once where files in ~/Dropbox/Projects/notes/ wouldn't sync outgoing, yet ~/Dropbox/Projects/code/ files would. If it doesn't, you'll need to reinstall Dropbox (instructions are below). It shouldn't be a firewall issue because the docs specify that Dropbox runs on port 80.

Alright, if you made this far and everything seems working, the great!

You may also want to delete any cached deleted files from the exclude step (NOTE: don't even try this unless you need it and at least until you're sure dropbox is already working):

 $ sudo service dropbox stop
 $ rm -rf ~/Dropbox/.dropbox.cache/*
 $ sudo service dropbox start

Don't delete the .dropbox.cache dir itself, just its contents.

If you ever need to delete the dropbox instance:

^C to stop the dropboxd daemon
rm -rf ~/.dropbox* ~/Dropbox
Revoke the token that was issued for your session by clicking 'unlink' next to your Droplet's host name at https://www.dropbox.com/account#security

It's important that you actually delete the ~/.drobox* folders, because your old key is stored there, and you have to generate a new one next time you run the original wget commands. So don't think of keeping it around as a shortcut to reinstalling.

Sources:
https://www.dropbox.com/install?os=linux
https://www.digitalocean.com/community/questions/dropbox-works-with-digitalocean-droplets

Sunday, March 2, 2014

The Path to Homebrew

The home page and documentation for Homebrew show how to install and use Homebrew. However, they currently don't seem to explain exactly how pathing works. This can trip up a lot of newcomers, who might give up on Homebrew or fumble around the internet and land on bad advice - such as using sudo and editing /etc/paths. All of this is unnecessary and potentially dangerous.

You just really need to understand a few basic concepts:

Never run brew as sudo. Not "sudo brew install" nor "sudo brew link".
The "Cellar" is a place that all your "kegs" go. Homebrew installs packages to their own directory (in the Cellar) and then symlinks their files into /usr/local/.
Change /usr/local/* to be owned by $USER, not root, so you can have write permissions and not need sudo.
The $PATH entry for /usr/local/bin should occur before /usr/bin.

Here's an example of installing Python and setting paths correctly. I'm using OS X 10.9.2 (Mavericks) and Homebrew 0.9.5:

    
 $ sudo chown -R $USER /usr/local/*
 $ brew doctor
 $ brew update
 $ brew install python --with-brewed-openssl
 $ ls /usr/local/Cellar/python
 2.7.6
 $ python
 >>> 2.7.5

Wait, I expected to see python 2.7.6 now. What happened?

 $ which python
 /usr/bin/python

But the Homebrew docs said we will be using a symlink from /usr/local/bin/ that points at the Cellar instead of using /usr/bin:

 $ ls -l /usr/local/bin/python
 lrwxr-xr-x 1 rkulla admin 33 Mar 2 06:37 /usr/local/bin/python ->
 ../Cellar/python/2.7.6/bin/python

Aha. So it must be a PATH issue:

 $ echo PATH
 /usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin

 $ cat /etc/paths
 /usr/bin
 /bin
 /usr/sbin
 /sbin
 /usr/local/bin

When using Homebrew, we actually now want to change the position of /usr/local/bin to be before /usr/bin in $PATH. But don't edit /etc/paths. Instead edit ~/.bashrc and prepend /usr/local/bin to $PATH, like:

 PATH=/usr/local/bin:$PATH

Next, run:

 $ source ~/.bashrc
 $ echo $PATH
 $ /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin

Don't worry that there's duplicate entries for /usr/local/bin. It doesn't matter. What matters is that you've modified your PATH safely and ultimately cleaner than the other ways.

 $ ls -l $(which python)
 lrwxr-xr-x 1 rkulla admin 33 Mar 2 06:37 /usr/local/bin/python ->
 ../Cellar/python/2.7.6/bin/python

Yay! And if you ever want to use the old python you can just run: /usr/bin/python.

Best of all from now on whenever you need to install anything with brew, you can just run:

 $ brew install whatever

and not have to fumble with permissions or worry about overwriting any global system files.

I also like to source ~/.bashrc from ~/.bash_profile. For why see: .bash_profile vs .bashrc

Ryan Kulla