Thursday, September 18, 2014

TL;DW for Clojure Data Science

Edmund Jackson talked at the 2012 Clojure/Conj, and you can see his talk here.

I took these notes as I watched it:
  1. What is "data science"?
    1. "That realm of endeavor that requires, simultaneously, advanced computational and statistical methods."
    2. Some people aren't sure whether "data science" is a thing, or just data analysis dressed up with a fancy name. That question amuses me.
  2. What's new, such that everybody suddenly cares about data science?
    1. widely available computing resources, open source tools such as R, and large amounts of data available in private companies and in public
    2. Compares to early days of Linux, when there was a bunch of new stuff that everybody could hack on
  3. Interactive tools aren't enough; you're not taking some data, analyzing it, and coming back with the answer. You need platform features like native language speed, data structures, language constructs, connectivity, and QC in order to embed your analysis in business processes.
  4. The tools with better analysis features (e.g., R, Mathematica) lack the platform features, and the tools with better platform features (he focuses primarily on C++ as his example here) lack the analysis features.
  5. Python is in the sweet spot, with platform features and (via numpy, scipy, and pandas) analysis features. But:
    1. It's full of mutable data!
    2. The mode of expression in imperative languages poorly matches the content of expression when you're dealing with maths.
  6. F#, Scala, and Clojure are all functional, and therefore (immutable data, more natural expression of maths) better alternatives than Python.
  7. Clojure yay! points:
    1. Native: Incanter, Storm, Cascalog, Datomic
    2. JVM: Mahout (ML on Hadoop), jBLAS, Weka (Java lib with many ML algorithms)
    3. Interop: Rincanter (call out to R), JNI
  8. From here he goes into calculating the entropy of a distribution, and the relative entropy of different distributions.
  9. Demonstrates using relative entropy fns in Datomic queries

Wednesday, September 3, 2014

TL;DW for "How To Design A Good API and Why it Matters"

Josh Bloch's Google Tech Talk video How To Design A Good API and Why it Matters is about an hour long, and well worth your time. It's focused on OOP, but has lots of good principles that can be followed elsewhere.

In case you don't have an hour right now, here's a summary/index kind of thing that points out the bits I thought were most important.
  1. 6:27: Characteristics of a good API:
    1. Easy to learn
    2. Easy to use, even without documentation
    3. Hard to misuse
    4. Easy to read and maintain code that uses it
    5. Sufficiently powerful to satisfy requirements
    6. Easy to evolve
    7. Appropriate to audience
  2. 7:52: Gather requirements, but differentiate between true requirements (which should take the form of use cases) and proposed solutions.
  3. 10:02: Start with a short spec; one page is ideal.
    1. Agility trumps completeness at this point.
    2. Get as many spec reviews from as many audiences as possible, modify according to feedback.
    3. Flesh the spec out as you gain confidence.
  4. 15:10: Write to your API early and often
    1. Start writing to your API before you've implemented it, or even specified it properly.
    2. Continue writing to your API as you flesh it out.
    3. Your code will live on in examples and unit tests.
  5. 17:32: Write to SPI [Service Provider Interface]
    1. Write at least three plugins before your release.
    2. Application in Clojure-land: Not sure...
  6. 19:35: Maintain realistic expectations.
    1. You won't please everyone.
    2. Aim to displease everyone equally.
    3. Expect to make mistakes and evolve the API in the future.
  7. 22:01: API should do one thing and do it well.
    1. Functionality should be easy to explain.
    2. If it's hard to name, that's a bad sign.
      1. Example of bad name that I can't leave out of this summary: OMGVMCID
  8. 24:32: API should be as small as possible but no smaller
    1. "When in doubt, leave it out." You can always add stuff, but you can't ever remove anything you've included. (The speaker calls this out as his most important point.)
  9. 26:27: Implementation should not impact API.
    1. Do not over-specify. For example, nobody needs to know how your hash function works, unless the hashes are persistent.
    2. Don't leak implementation details such as SQL exceptions!
  10. 29:36: Minimize accessibility of everything.
    1. Don't let API callers see stuff you don't want to be public, and that includes anything you might want to change in the future.
  11. 30:39: Names matter: API is a little language.
    1. Make names self-explanatory.
    2. Be consistent.
    3. Strive for symmetry. (If you can GET a monkey-uncle, make sure you can PUT a monkey-uncle, too.)
  12. 32:32: Documentation matters.
    1. Document parameter units! ("Length of banana in centimeters")
  13. 35:41: Consider performance consequences of API design decisions.
    1. Bad decisions can limit performance -- and this is permanent.
    2. Do not warp your API to gain performance -- the slow thing you avoided can be fixed and get faster, but your warped API will be permanent.
    3. Good design usually coincides with good performance.
  14. 40:00: Minimize mutability
    1. Make everything immutable unless there's a reason to do otherwise.
  15. 45:31: Don't make the caller do anything your code should do.
    1. If there are common use cases that require stringing a bunch of your stuff together in a boilerplate way, that's a bad sign.
  16. 48:36: Don't violate the principle of least astonishment
    1. Make sure your API callers are never surprised by what the API does.
  17. 50:03: Report errors as soon as possible after they occur.
  18. 52:00: Provide programmatic access to all data that is available in string form.
    1. Rich Hickey makes a similar point here.
  19. 56:15: Use consistent parameter ordering across methods.
    1. Here's a bad example:
      1. char *strncpy (char *dst, char *src, size_t n);
      2. void bcopy (void *src, void *dst, size_t n);
  20. 57:15: Avoid long parameter lists.
  21. 58:21: Avoid return values that demand exceptional processing.
    1. Example: return an empty list instead of nil/null.

Friday, February 14, 2014

hostnames as commands

Several years ago, I adopted a practice I've realized I should write down. I have two shell scripts that live in ~/bin/:
james.mojo.home ~ $ cat bin/ssh-host
#!/bin/bash

start=`date`
remote_host=`basename $0`
if ! ssh $remote_host $*; then
    echo from $start to
    date
fi
james.mojo.home ~ $ cat bin/mosh-host
#!/bin/bash

start=`date`
remote_host=`basename $0`
if ! mosh $remote_host -- $*; then
    echo from $start to
    date
fi
And I have many symlinks in ~/bin/ that point to those scripts. For example:
lrwxr-xr-x 1 moquist staff 8 Jul 12 2013 aristotle -> ssh-host
lrwxr-xr-x 1 moquist staff 8 Jul 12 2013 bhs.somedomain.com -> ssh-host 
lrwxr-xr-x 1 moquist staff 8 Jul 12 2013 devserver.somedomain.com -> mosh-host 

Of course I also have ~/.ssh/config set up, and my SSH keys are all in the appropriate ~/.ssh/authorized_keys files on remote systems.

But once all that's done, if I want to log in to a system, I can just type the name of the system (with tab completion). If I want to pipe something into or out of a command on a remote system (via ssh-host only), the system name just becomes another command:
james.mojo.home ~ $ aristotle "w | grep eviluser || echo eviluser is absent"
eviluser is absent
james.mojo.home ~ $ aristotle cat somefile | grep bits-i-want
### elided ###
james.mojo.home ~ $ for h in aristotle plato plantinga kant; do echo ====$h====; $h ls | grep lostfile; done
Obviously these are contrived examples, and there are plenty of other ways to do the same things. I've just found it convenient to think of hosts as commands, and this approach has let me do that.