Hypothesis

9 Matching Annotations

Jan 2025
ericmjl.github.io ericmjl.github.io

Untitled document

1
1. structseeker 26 Jan 2025
  
  in Public
  
  Python Data Science Bootstrap
  
  Software Engineering Python Handbook Data Science Guide
Visit annotations in context

Tags

Python

Data Science

Software Engineering

Guide

Handbook

Annotators

structseeker

URL

ericmjl.github.io/data-science-bootstrap-notes/get-bootstrapped-on-your-data-science-projects/
Nov 2022
scribe.rip scribe.rip

How to architect the perfect Data Warehouse

1
1. ravenscroftj 23 Nov 2022
  
  in Public
  
  One example could be putting all files into an Amazon S3 bucket. It’s versatile, cheap and integrates with many technologies. If you are using Redshift for your data warehouse, it has great integration with that too.
  
  Essentially the raw data needs to be vaguely homogenised and put into a single place
  
  ELT data-engineering
Visit annotations in context

Tags

data-engineering

ELT

Annotators

ravenscroftj

URL

scribe.rip/how-to-architect-the-perfect-data-warehouse-b3af2e01342e
rmoff.net rmoff.net

Data Engineering in 2022: ELT tools

2
1. ravenscroftj 20 Nov 2022
  
  in Public
  
  It took me a while to grok where dbt comes in the stack but now that I (think) I have it, it makes a lot of sense. I can also see why, with my background, I had trouble doing so. Just as Apache Kafka isn’t easily explained as simply another database, another message queue, etc, dbt isn’t just another Informatica, another Oracle Data Integrator. It’s not about ETL or ELT - it’s about T alone. With that understood, things slot into place. This isn’t just my take on it either - dbt themselves call it out on their blog:
  
  Also - just because their "pricing" page caught me off guard and their website isn't that clear (until you click through to the technical docs) - I thought it's worth calling out that DBT appears to be an open-core platform. They have a SaaS offering and also an open source python command-line tool - it seems that these articles are about the latter
  
  ELT data-engineering
2. ravenscroftj 20 Nov 2022
  
  in Public
  
  Working with the raw data has lots of benefits, since at the point of ingest you don’t know all of the possible uses for the data. If you rationalise that data down to just the set of fields and/or aggregate it up to fit just a specific use case then you lose the fidelity of the data that could be useful elsewhere. This is one of the premises and benefits of a data lake done well.
  
  absolutely right - there's also a data provenance angle here - it is useful to be able to point to a data point that is 5 or 6 transformations from the raw input and be able to say "yes I know exactly where this came from, here are all the steps that came before"
  
  data-engineering data-science ELT
Visit annotations in context

Tags

ELT

data-science

data-engineering

Annotators

ravenscroftj

URL

rmoff.net/2022/11/08/data-engineering-in-2022-elt-tools/
Sep 2021
rapidminer.com rapidminer.com

You Need an Embedded Data Science Factory, Not a Research Institute - RapidMiner

1
1. mlenc 10 Sep 2021
  
  in Public
  
  data science data engineering factory not lab digital transformation
Visit annotations in context

Tags

factory not lab

digital transformation

data engineering

data science

Annotators

mlenc

URL

rapidminer.com/blog/data-science-factory/
Apr 2021
www.turing.ac.uk www.turing.ac.uk

Research Engineering

1
1. mlenc 23 Apr 2021
  
  in Public
  
  research infrastructure research engineering information infrastructure data infrastructure open science
Visit annotations in context

Tags

research engineering

information infrastructure

research infrastructure

open science

data infrastructure

Annotators

mlenc

URL

turing.ac.uk/work-turing/research/research-engineering
Aug 2020
www.newyorker.com www.newyorker.com

What Can America Learn from Europe About Regulating Big Tech?

1
1. ErikStuchly 19 Aug 2020
  
  in BehSci
  
  Romeo, N. (n.d.). What Can America Learn from Europe About Regulating Big Tech? The New Yorker. Retrieved August 19, 2020, from https://www.newyorker.com/tech/annals-of-technology/what-can-america-learn-from-europe-about-regulating-big-tech
  
  is:news lang:en technology business Silicon Valley democracy tradition engineering money tech company regulation policy legislation control taxation data use
Visit annotations in context

Tags

technology

tradition

legislation

regulation

Silicon Valley

lang:en

engineering

tech company

data use

money

is:news

policy

democracy

control

taxation

business

Annotators

ErikStuchly

URL

newyorker.com/tech/annals-of-technology/what-can-america-learn-from-europe-about-regulating-big-tech
Nov 2018
multithreaded.stitchfix.com multithreaded.stitchfix.com

Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department | Stitch Fix Technology – Multithreaded

1
1. IanMulvany 08 Nov 2018
  
  in Public
  
  Unless you need to push the boundaries of what these technologies are capable of, you probably don’t need a highly specialized team of dedicated engineers to build solutions on top of them. If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … – places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”. Messes tend to necessitate specialization.
  
  data-science data-enginerring engineering
Visit annotations in context

Tags

engineering

data-enginerring

data-science

Annotators

IanMulvany

URL

multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/
Jul 2018
engineeringblog.yelp.com engineeringblog.yelp.com

More Than Just a Schema Store

1
1. scottyoliver 16 Jul 2018
  
  in Public
  
  We noticed that the people who use the data are usually not the same people who produce the data, and they often don’t know where to find the information about the data they try to use. Since the Schematizer already has the knowledge about all the schemas in the Data Pipeline, it becomes an excellent candidate to store information about the data. Meet our knowledge explorer, Watson. The Schematizer requires schema registrars to include documentation along with their schemas. The documentation then is extracted and stored in the Schematizer. To make the schema information and data documentation in the Schematizer accessible to all the teams at Yelp, we created Watson, a webapp that users across the company can use to explore this data. Watson is a visual frontend for the Schematizer and retrieves its information through a set of RESTful APIs exposed by the Schematizer.
  
  data-engineering
Visit annotations in context

Tags

data-engineering

Annotators

scottyoliver

URL

engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html

Python Data Science Bootstrap

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL