I took these notes as I watched it:
- What is "data science"?
- "That realm of endeavor that requires, simultaneously, advanced computational and statistical methods."
- Some people aren't sure whether "data science" is a thing, or just data analysis dressed up with a fancy name. That question amuses me.
- What's new, such that everybody suddenly cares about data science?
- widely available computing resources, open source tools such as R, and large amounts of data available in private companies and in public
- Compares to early days of Linux, when there was a bunch of new stuff that everybody could hack on
- Interactive tools aren't enough; you're not taking some data, analyzing it, and coming back with the answer. You need platform features like native language speed, data structures, language constructs, connectivity, and QC in order to embed your analysis in business processes.
- The tools with better analysis features (e.g., R, Mathematica) lack the platform features, and the tools with better platform features (he focuses primarily on C++ as his example here) lack the analysis features.
- Python is in the sweet spot, with platform features and (via numpy, scipy, and pandas) analysis features. But:
- It's full of mutable data!
- The mode of expression in imperative languages poorly matches the content of expression when you're dealing with maths.
- F#, Scala, and Clojure are all functional, and therefore (immutable data, more natural expression of maths) better alternatives than Python.
- Clojure yay! points:
- Native: Incanter, Storm, Cascalog, Datomic
- JVM: Mahout (ML on Hadoop), jBLAS, Weka (Java lib with many ML algorithms)
- Interop: Rincanter (call out to R), JNI
- From here he goes into calculating the entropy of a distribution, and the relative entropy of different distributions.
- Demonstrates using relative entropy fns in Datomic queries
No comments:
Post a Comment