data/

directory
v0.0.0-...-b3f521c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2017 License: Apache-2.0

README

Ultimate Data

This is material for developers who have some experience with Go and statistics and want to learn how to work with data to make better decisions. We believe these classes are perfect for data analysts/scientists/engineers interested in working in Go or Go programmers interested in doing data analysis.

Ultimate Data

Design Guidelines

You must develop a design philosophy that establishes a set of guidelines. This is more important than developing a set of rules or patterns you apply blindly. Guidelines help to formulate, drive and validate decisions. You can't begin to make the best decisions without understanding the impact of your decisions. Every decision you make, every line of code you write comes with trade-offs.

What is data analysis?

Data analysis uses Datasets to make Decisions that have corresponding Actions and Consequences.

Prepare your mind

Every data analytics or data science project must begin by considering the:

  1. Decisions
  2. Actions
  3. Consequences

Before and during any data analytics project, you must be able to answer the following questions:

  • What decisions do I want to make based on the results?
  • What actions are triggered by the decisions that will be made?
  • What are the consequences of those actions?
  • What do the results need to contain?
  • What is the data required to produce a valid result?
  • How will I measure the results are valid?
  • Can the results be effectively conveyed to decision makers?
  • Am I confident in the results?

Remember, uncertainty is not a license to guess but a directive to stop.

Order of Operations

Data analytics projects should follow these steps in this order:

  1. Understand the decisions, actions and consequences involved.
  2. Understand the relevant data to be gathered and analyzed.
  3. Gather and organize the relevant data.
  4. Understand the readability and expectations for determining valid results.
  5. Determine the most interpretable process to produce the valid results.
  6. Determine how you will test the validity of the results.
  7. Develop the determined process and tests.
  8. Test the results and evaluate against your expectations.
  9. Refactor as necessary.
  10. Looks for ways to simplify, minimize and reduce.

When the results don’t meet the expectations, ask yourself if modifying the determined process or data improve the validity of the results?

  • If YES, you must re-evaluate:
    • Are such modifications warranted?
    • Can the modification be tested against the expectations?
    • Do I need to increase complexity?
    • Have I tested the most simplistic and interpretable solutions first?
  • In NO, you must re-evaluate:
    • Am I using the best determined process?
    • Am I using the best data?
    • Are my expectations incorrect?

Guidelines, Decision Making and Trade-Offs

Develop your design philosophy around these major categories in this order: Integrity, Value, Readability/Interpretability, and Performance. You must consciously and with great reason be able to explain the category you are choosing.

Note: There are exceptions to everything but when you are not sure an exception applies, follow the guidelines presented the best you can.

1) Integrity - If data science uses Datasets to make Decisions, a breakdown in integrity results in bad decisions. These decisions impact people, and therefore, making bad decisions may cause irreparable damage to real people. Nothing trumps integrity - EVER.

Rules of Integrity:

  • Error handling code is the main code.
  • You must understand the data.
  • Control the input and output of your processes.
  • You must be able to reproduce results.

2) Value - Effort without actionable results is not valuable. Just because you can produce a result, does not mean the result contains value.

Rules of Value:

  • If an action can not be taken based on a result, the result does not have value.
  • If the impact of a result can not be measured, the result does not have value.

3) Readability and Interpretability - This is about writing simple analyses that are easy to read and understand without mental exhaustion. However, this is also about avoiding unnecessary data transformations and analysis complexity that hides:

  • The cost/impact of individual steps of the analyses.
  • The underlying purpose of the data transformations and analyses.

4) Performance - This is about making your analyses run as fast as possible and produce results that minimize a given measure of error. When code is written with this as the priority, it is very difficult to write code that is readable, simple or idiomatic. If increasing the accuracy, e.g., of a given result by 0.001% takes a significant increase in effort/complexity and doesn’t produce more value or differing actions, the effort Optimization/Efficiency is not warranted.

Directories

Path Synopsis
caching
classification_kNN
classification_trees
csv_cleaning
csv_io
data_versioning
dimensionality_reduction
evaluation
hypothesis_testing
integrity
json
matrices
matrix_operations
regression
sql
stats_measures
stats_visualization

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL