A Great Debate — What Language For Data Science
- July 6, 2016
- 0 Comments
[Editor’s note: This is a guest post from Nolan Grace, a software developer and consultant at BP3 who has a passion for data science. This post is shared with permission]
Programming languages in the computer science industry are an extremely interesting topic. New languages are created all the time and there are developers who adore change and those who despise it. A personal mantra of mine is, “Use the right tool for the job.” No matter what I am working on, I have a problem to solve and I am going to use the right tools to architect the most elegant solution. Bringing us to the topic of this blog, What is the right programming language for data science? For those of you who aren’t familiar with data science it is the applied science of collecting, storing, and analyzing data in order to build a better understanding of the information. Sorry to burst your bubble but data science is too broad to come up with one perfect programing language. Like many things programming languages have pros and cons that can work well in some situations and poorly in others. Hopefully I can detail the main features and flaws of the top three programming languages in the ecosystem and you can use this information to make the right choice for the problem you’re trying to solve.
Python is an extremely interesting language and it continues to surprise me. My first experience with Python was in an intro-level programming class I took at Purdue. At the time I was already familiar with Java and C so an introduction to programming in Python was quite the transition. Python does not require you to declare types, use semicolons, or even compile the code. My initial thoughts about Python were not positive. After working in C, Python felt like switching from a nicely sharpened pencil to finger paints. However after college I started working with data and I realized not everything has to be precise. Sometimes you just need to write something quickly and see how it works.
By leveraging programming notebooks like Jupyter and Zeppelin you can get in, crunch some data, and chart your findings all in one language. A programming notebook is typically a web UI where you can easily write and test code. These environments work very nicely for the fast paced trial and error nature of data science research. Python is an easy to use language that has libraries out the wazoo. If you are looking to get started in Python for data science I would recommend taking a look at the Python Anaconda stack which is a huge package of Python libraries; Anaconda contains Jupyter programming notebook, statistical, and charting libraries all in one place.
Python, however, is not all magic and rainbows. In particular, Python tends to be significantly less performant than other languages. By not compiling and by allocating memory for on the fly Python has no opportunity to build an optimized execution plan. This makes Python work great as a scripting language but it tends to fall short when being used to build and deploy production applications.
I view Python as a nice, easy to use, research tool. Depending on the amount of data you are working with, you may need to consider another language if you aren’t interested in waiting forever for all your data to get crunched.
- Moderate Performance
- Easy to Use
- Huge Community
- Tons of Libraries
As someone who started programming in Java, I have a fondness for it. Sadly, I wouldn’t consider Java to be the most appropriate programming language to use for data science due to the bulkiness of the code that comes with the object precision.
Scala is a programming language that runs in the JVM with Java and compiles and executed the same. Anything written in Java can be easily called in Scala so you can take advantage of all the existing Java libraries. The big difference between Java and Scala is that Scala was designed to be a much more functional programming language. While Java is moving in the direction of Scala with the introduction of lambdas in Java 8, the developers of Scala decided not to wait.
Scala can also run in programming notebooks similar to Python, but so far I have been unable to find a good notebook that can run Java. Scala is also very simple to write like Python. In my opinion it can be much more expressive allowing for higher level programming. Because Scala compiles and runs in the JVM it also has significant performance gains over Python and makes it a great language to write and deploy production quality code. For example, Apache park and a lot of new and important data science frameworks are built in Scala and use Scala as a default implementation language.
Scala is a fairly new programming tool so the community and documentation aren’t as great as Python. When it comes to visualization libraries Scala has some options, but nowhere near the quality and quantity of Python. In summary, Scala is much more performant than Python but not quite as easy to use and has far less existing research libraries…but let’s not forget Scala can leverage everything written in Java.
- Highly stable and performant
- Moderate community
- Moderately easy to use
R is the most unique contender in the field. R is a programming language developed by researchers for researchers. That makes it very different from any programming language I have ever worked with, and makes it very hard to wrap your head around if you are familiar with other programming languages. What R lacks in familiarity it makes up for in raw statistical ability. R is the powerhouse in building and customizing statistical operations. If you are just looking to pull in some data and really have some statistical power over it I would recommend R. R has a very big community of researchers and a large existing base of libraries you can use.
R also tends to be the worst performant of all the languages I have mentioned today and it can be leveraged in production if you know what you’re doing. Stay tuned for my blog on Apache SystemML for more information on scaling R.
- Difficult to use well
- Poor performance at scale
- Specialized Community
- Powerful Statistical Operations
How much data do you have? What kind of statistics do you need? How much external help are you going to need? These are the questions you need to ask yourself when you are trying to figure out the best tool for your job. Use this information as a stepping stone and try something out in each language. Experimentation really is the best way to learn in the programming world. If I had to choose one language I would pick Scala but what works for me may not work for you.
 Python For Data Science by Jake VanderPlas
 Python in a Nutshell, 3rd Edition by Steve Holden, Anna Ravenscroft, Alex Martelli
 High Performance Python by Ian Ozsvald, Micha Gorelick
 Programming in Scala, Third Edition by Bill Venners, Lex Spoon, Martin Odersky
 Parallel R by Stephen Weston, Q. Ethan McCallum