Getting Started with dplyr

A ridiculously useful R package for wrangling data:

Part 1 http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/

Part 2 http://www.dataschool.io/dplyr-tutorial-part-2/

Advertisements

R Markdown

Great primer video on how to use R Markdown to format a nice looking report in either HTML, PDF, Word, or Slideshow. Includes information on how to format your report, use parameters to easily change/iterate the report on different variables, and create templates to load certain styles, images, fonts to create documents with the same style options every time you create a new report. Also introduces htmlwidgets to create interactive visualizations or turn your report into a Shiny Interactive Document, to create interactive reports.

R Studio + Github

  • Git allows you to fail and easily roll back to previous states
  • It’s like saving the same file with different filenames “data.doc” “dataFINAL.doc” “datafinalfinalfinal.doc”
  • Think about your project as a series of changes.
  • Each iteration should have:
    • unique identifier
    • what changed?
    • when did it change?
    • who changed it?
    • why did it change?
  • In git: each change is called a commit and has a unique identifier called a sha
  • Tags can be added (some are added automatically, like Head) to name particularly important commits
  • diff is what changed between commits
  • git is great at undoing things. including undoing an undo!
  • you’re always working with at least one other person on a project: Your Future Self
  • git bisect — testing different commits to find when in history of commits the problem first occurred

An incredible resource on connecting RStudio and Git: http://happygitwithr.com/credential-caching.html 

 

Capstone Ideas for a Data Science Course

I work for an arts organization. We have a robust database (with over 15 years of data) on visitors, donors, members, and other community contacts that data on includes visitation, purchases, memberships and other donations, survey responses, addresses, and email responses. I have been working with this data set for over 3 years (and diving in deep as the Data Analyst for a year), and have several current areas of interest for projects:

 

Sentiment analysis on survey responses

We send semi-regular surveys to our visitors. In particular, we have 3 large surveys I’d like to look at. These are annual surveys we’ve sent to our email list each summer that ask a number of questions about brand perception and overall satisfaction with our organization. These surveys all contain a lot of qualtitative data I’d like to parse to find what matters most to our audience and if/how that has changed over the last 3 years.

 

Email Response Analysis:

We have 3 “audience segments” (defined for us by an outside consultant) we’d like to learn more about. People qualify for one of the segments based on their answers to psychographic questions. We have many people tagged with their audience segment, and I’d like to look at each segment’s email behavior (what have they opened, clicked-through on, unsubscribed from) to find & define the best way to communicate to our 3 audience segments to help inform our Email Strategist on new ways to craft emails specifically targeting each segment.

 

Member Analysis:

Pretty standard problem for any membership/subscription service. We have a membership program and our CRM has data on the timing, amount spent, and length of these memberships. I’d like to do some work to find & define either one, two, or all three of these segments (depending on time):

  • nonmembers (ticketbuyers) in database most likely to become members
  • members most likely to renew
  • members/donors most likely to upgrade

 

Why & How Data Science?

Data Scientists finds new discoveries. They make a hypothesis and they try to investigate that hypothesis. Look for meaning, knowledge in the data.

Data/Business Analyst: Visualize the data, create reports & look for patterns

Data Scientists: Use advanced Algorithms to run through the data looking for meaning. This is what distinguishes you from data analysts.

Strong foundational knowledge of math, statistics, computer science.

Use a dataset + algorithms to answer a question. Which customers are likely to churn? How can I improve the recommendations we give our customers for products? Data scientists have hypotheses of what is important in the data and they test it. They need both technical skills/domain knowledge and instinct/insight of business needs.

  • Understand the data
  • 70-80% of a data scientists’ time is assembling data (SQL statement, text mining). You eventually want to mine this out to data engineers/data integration specialists
  • Discovery process – running algorithms, finding new knowledge. You eventually want to spend all your time on this.

Further reading: https://www.safaribooksonline.com/library/view/doing-data-science/9781449363871/ch01.html

  • This article talks about what a good data team should look like — a collection of people with varying levels of skillsets in Data Viz, Machine Learning, Math, Stats, Computer Science, Communication, and Domain Expertise.
  • “A chief data scientist should be setting the data strategy of the company, which involves a variety of things: setting everything up from the engineering and infrastructure for collecting data and logging, to privacy concerns, to deciding what data will be user-facing, how data is going to be used to make decisions, and how it’s going to be built back into the product. She should manage a team of engineers, scientists, and analysts and should communicate with leadership across the company, including the CEO, CTO, and product leadership. She’ll also be concerned with patenting innovative solutions and setting research goals.”
  • “More generally, a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and munging data, because data is never clean. This process requires persistence, statistics, and software engineering skills—skills that are also necessary for understanding biases in the data, and for debugging logging output from code.”
  • “Once she gets the data into shape, a crucial part is exploratory data analysis, which combines visualization and data sense. She’ll find patterns, build models, and algorithms—some with the intention of understanding product usage and the overall health of the product, and others to serve as prototypes that ultimately get baked back into the product. She may design experiments, and she is a critical part of data-driven decision making. She’ll communicate with team members, engineers, and leadership in clear language and with data visualizations so that even if her colleagues are not immersed in the data themselves, they will understand the implications.”

dnds_0103

dnds_0104

Yikes.

Want to help arts organizations support queer, POC, and women artists? Attend their shows. Donate a dollar with a note saying you did so to support an artist being shown there. Click on emails about their work. Engage with online content about their work (like, comment, repost, etc.). It’s crazy to me that one of the top factors in our most successful exhibitions (when success is define by attendance, revenue, or outreach engagement) is whether or not the title of the exhibition has a male name in it, no matter how well-known or obscure those artists are. We’re not going to stop showcasing under-represented artists, but it’s so disheartening to find numbers like this lurking in our data. [chart showing male exhibitions performing better].