3,440 Matching Annotations
  1. Nov 2020
    1. What is Data Egress? Managing Data Egress to Prevent Sensitive Data Loss

      [[What is Data Egress? Managing Data Egress to Prevent Sensitive Data Loss]]

    1. Portable... your .name address works with any email or web service. With our automatic forwarding service on third level domains, you can change email accounts, your ISP, or your job without changing your email address. Any mail sent to your .name address arrives in any email box you choose.
    1. In-depth questionsThe following interview questions enable the hiring manager to gain a comprehensive understanding of your competencies and assess how you would respond to issues that may arise at work:What are the most important skills for a data engineer to have?What data engineering platforms and software are you familiar with?Which computer languages can you use fluently?Do you tend to focus on pipelines, databases or both?How do you create reliable data pipelines?Tell us about a distributed system you've built. How did you engineer it?Tell us about a time you found a new use case for an existing database. How did your discovery impact the company positively?Do you have any experience with data modeling?What common data engineering maxim do you disagree with?Do you have a data engineering philosophy?What is a data-first mindset?How do you handle conflict with coworkers? Can you give us an example?Can you recall a time when you disagreed with your supervisor? How did you handle it?

      deeper dive into [[Data Engineer]] [[Interview Questions]]

    1. to be listed on Mastodon’s official site, an instance has to agree to follow the Mastodon Server Covenant which lays out commitments to “actively moderat[e] against racism, sexism, homophobia and transphobia”, have daily backups, grant more than one person emergency access, and notify people three months in advance of potential closure. These indirect methods are meant to ensure that most people who encounter a platform have a safe experience, even without the advantages of centralization.

      Some of these baseline protections are certainly a good idea. The idea of advance notice of shut down and back ups are particularly valuable.

      I'd not know of the Mastodon Server Covenant before.

    1. The Hierarchy of AnalyticsAmong the many advocates who pointed out the discrepancy between the grinding aspect of data science and the rosier depictions that media sometimes portrayed, I especially enjoyed Monica Rogati’s call out, in which she warned against companies who are eager to adopt AI:Think of Artificial Intelligence as the top of a pyramid of needs. Yes, self-actualization (AI) is great, but you first need food, water, and shelter (data literacy, collection, and infrastructure).This framework puts things into perspective.

      [[the hierarchy of analytics]]

    1. aberrant behavior

      Is there data on hand that shows these companies actually prevent cheating? How many instances of 'aberrant behavior' actually materialize into cheating offenses?

    2. keystroke biometrics, ID capture, and facial analysis

      I feel like I'm seeing various responses to what data is actually captured. To me, it doesn't seem like they are consistent in their responses about the types of data collected.

    1. Maybe your dbt models depend on source data tables that are populated by Stitch ingest, or by heavy transform jobs running in Spark. Maybe the tables your models build are depended on by analysts building reports in Mode, or ML engineers running experiments using Jupyter notebooks. Whether you’re a full-stack practitioner or a specialized platform team, you’ve probably felt the pain of trying to track dependencies across technologies and concerns. You need an orchestrator.Dagster lets you embed dbt into a wider orchestration graph.

      It can be common for [[data models]] to rely on other sources - where something like [[Dagster]] fits in - is allowing your dbt fit into a wider [[orchestration graph]]

    2. We love dbt because of the values it embodies. Individual transformations are SQL SELECT statements, without side effects. Transformations are explicitly connected into a graph. And support for testing is first-class. dbt is hugely enabling for an important class of users, adapting software engineering principles to a slightly different domain with great ergonomics. For users who already speak SQL, dbt’s tooling is unparalleled.

      when using [[dbt]] the [[transformations]] are [[SQL statements]] - already something that our team knows

    3. What is dbt?dbt was created by Fishtown Analytics to enable data analysts to build well-defined data transformations in an intuitive, testable, and versioned environment.Users build transformations (called models) defined in templated SQL. Models defined in dbt can refer to other models, forming a dependency graph between the transformations (and the tables or views they produce). Models are self-documenting, easy to test, and easy to run. And the dbt tooling can use the graph defined by models’ dependencies to determine the ancestors and descendants of any individual model, so it’s easy to know what to recompute when something changes.

      one of the [[benefits of [[dbt]]]] is that the [[data transformations]] or [[data models]] can refer to other models, and help show the [[dependency graph]] between transformatios

    1. The attribution data modelIn reality, it’s impossible to know exactly why someone converted to being a customer. The best thing that we can do as analysts, is provide a pretty good guess. In order to do that, we’re going to use an approach called positional attribution. This means, essentially, that we’re going to weight the importance of various touches (customer interactions with a brand) based on their position (the order they occur in within the customer’s lifetime).To do this, we’re going to build a table that represents every “touch” that someone had before becoming a customer, and the channel that led to that touch.

      One of the goals of an [[attribution data model]] is to understand why someone [[converted]] to being a customer. This is impossible to do accurately, but this is where analysis comes in.

      There are some [[approaches to attribution]], one of those is [[positional attribution]]

      [[positional attribution]] is that we are weighting the importance of touch points - or customer interactions, based on their position within the customer lifetime.

    2. Marketers have been told that attribution is a data problem -- “Just get the data and you can have full knowledge of what’s working!” -- when really it’s a data modeling problem. The logic of your attribution model, what the data represents about your business, is as important as the data volume. And the logic is going to change based on your business. That’s why so many attribution products come up short.

      attribution isn't a data problem, it's a data modeling problem]] - it's not just the data, but what the data represents about your business.

    1. I increasingly don’t care for the world of centralized software. Software interacts with my data, on my computers. Its about time my software reflected that relationship. I want my laptop and my phone to share my files over my wifi. Not by uploading all my data to servers in another country. Especially if those servers are financed by advertisers bidding for my eyeballs.
  2. Oct 2020
    1. This is until you realize you're probably using at least ten different services, and they all have different purposes, with various kinds of data, endpoints and restrictions. Even if you have the capacity and are willing to do it, it's still damn hard.
    2. Hopefully we can agree that the current situation isn't so great. But I am a software engineer. And chances that if you're reading it, you're very likely a programmer as well. Surely we can deal with that and implement, right? Kind of, but it's really hard to retrieve data created by you.
    1. (d) All calculations shown in this appendix shall be implemented on a site-level basis. Site level concentration data shall be processed as follows: (1) The default dataset for PM2.5 mass concentrations for a site shall consist of the measured concentrations recorded from the designated primary monitor(s). All daily values produced by the primary monitor are considered part of the site record; this includes all creditable samples and all extra samples. (2) Data for the primary monitors shall be augmented as much as possible with data from collocated monitors. If a valid daily value is not produced by the primary monitor for a particular day (scheduled or otherwise), but a value is available from a collocated monitor, then that collocated value shall be considered part of the combined site data record. If more than one collocated daily value is available, the average of those valid collocated values shall be used as the daily value. The data record resulting from this procedure is referred to as the “combined site data record.”
      1. Calculate mean of all collocated NON-primary monitors' values per day
      2. Coalesce primary monitor value with this calculated mean
    1. ​Institutions that were primarily online before the pandemic are also doing well. At colleges where more than 90 percent of students took courses solely online pre-pandemic, enrollments are growing for both undergraduate (6.8 percent) and graduate students (7.2 percent).
    1. If you define a variable outside of your form, you can then set the value of that variable to the handleSubmit function that 🏁 React Final Form gives you, and then you can call that function from outside of the form.
    1. We could freeze the objects in the model but don't for efficiency. (The benefits of an immutable-equivalent data structure will be documented in vtree or blog post at some point)

      first sighting: "immutable-equivalent data"

    2. A VTree is designed to be equivalent to an immutable data structure. While it's not actually immutable, you can reuse the nodes in multiple places and the functions we have exposed that take VTrees as arguments never mutate the trees.
    1. We don't know if the passed in props is a user created object that can be mutated so we must always clone it once.
    1. Legislation to stem the tide of Big Tech companies' abuses, and laws—such as a national consumer privacy bill, an interoperability bill, or a bill making firms liable for data-breaches—would go a long way toward improving the lives of the Internet users held hostage inside the companies' walled gardens. But far more important than fixing Big Tech is fixing the Internet: restoring the kind of dynamism that made tech firms responsive to their users for fear of losing them, restoring the dynamic that let tinkerers, co-ops, and nonprofits give every person the power of technological self-determination.
    1. (Roose, who has since deleted his tweet as part of a routine purge of tweets older than 30 days, told me it was intended simply as an observation, not a full analysis of the trends.)

      Another example of someone regularly deleting their tweets at regular intervals. I've seem a few examples of this in academia.

    1. More conspicuously, since Trump’s election, the RNC — at his campaign’s direction — has excluded critical “voter scores” on the president from the analytics it routinely provides to GOP candidates and committees nationwide, with the aim of electing down-ballot Republicans. Republican consultants say the Trump information is being withheld for two reasons: to discourage candidates from distancing themselves from the president, and to avoid embarrassing him with poor results that might leak. But they say its concealment harms other Republicans, forcing them to campaign without it or pay to get the information elsewhere.
    1. A statistician is the exact same thing as a data scientist or machine learning researcher with the differences that there are qualifications needed to be a statistician, and that we are snarkier.
    1. you can then use “Sign In with Google” to access the publisher’s products, but Google does the billing, keeps your payment method secure, and makes it easy for you to manage your subscriptions all in one place.  

      I immediately wonder who owns my related subscription data? Is the publisher only seeing me as a lumped Google proxy or do they get may name, email address, credit card information, and other details?

      How will publishers be able (or not) to contact me? What effect will this have on potential customer retention?

    1. Methodology To determine the link between heat and income in U.S. cities, NPR used NASA satellite imagery and U.S. Census American Community Survey data. An open-source computer program developed by NPR downloaded median household income data for census tracts in the 100 most populated American cities, as well as geographic boundaries for census tracts. NPR combined these data with TIGER/Line shapefiles of the cities.

      This is an excellent example of data journalism.

    1. Note that interacting with these <input> elements will mutate the array. If you prefer to work with immutable data, you should avoid these bindings and use event handlers instead.
    1. use-methods is built on immer, which allows you to write your methods in an imperative, mutating style, even though the actual state managed behind the scenes is immutable.
    1. 1.1. Monitors For the purposes of AQS, a monitor does not refer to a specific piece of equipment. Instead, it reflects that a given pollutant (or other parameter) is being measured at a given site. Identified by: The site (state + county + site number) where the monitor is located AND The pollutant code AND POC – Parameter Occurrence Code. Used to uniquely identify a monitor if there is more than one device measuring the same pollutant at the same site. For example monitor IDs are usually written in the following way: SS-CCC-NNNN-PPPPP-Q where SS is the State FIPS code, CCC is the County FIPS code, and NNNN is the Site Number within the county (leading zeroes are always included for these fields), PPPPP is the AQS 5-digit parameter code, and Q is the POC. For example: 01-089-0014-44201-2 is Alabama, Madison County, Site Number 14, ozone monitor, POC 2.

      How monitors (specific measures of specific criteria) are identified in AQS data.

  3. Sep 2020
    1. "The Data Visualisation Catalogue is a project developed by Severino Ribecca to create a library of different information visualisation types." I like the explanations of when one might use a particular type of data visualization to highlight - or obscure! - what the data is saying.

    1. The RDF model encodes data in the form ofsubject,predicate,objecttriples. The subjectand object of a triple are both URIs that each identify a resource, or a URI and a stringliteral respectively. The predicate specifies how the subject and object are related, and isalso represented by a URI.

      Basic description of Resource Description Framework

    1. Facebook ignored or was slow to act on evidence that fake accounts on its platform have been undermining elections and political affairs around the world, according to an explosive memo sent by a recently fired Facebook employee and obtained by BuzzFeed News.The 6,600-word memo, written by former Facebook data scientist Sophie Zhang, is filled with concrete examples of heads of government and political parties in Azerbaijan and Honduras using fake accounts or misrepresenting themselves to sway public opinion. In countries including India, Ukraine, Spain, Brazil, Bolivia, and Ecuador, she found evidence of coordinated campaigns of varying sizes to boost or hinder political candidates or outcomes, though she did not always conclude who was behind them.
    1. Provides state management for tree-like components. Handles building a collection of items from props, item expanded state, and manages multiple selection state.
    1. Nic Fildes in London and Javier Espinoza in Brussels April 8 2020 Jump to comments section Print this page Be the first to know about every new Coronavirus story Get instant email alerts When the World Health Organization launched a 2007 initiative to eliminate malaria on Zanzibar, it turned to an unusual source to track the spread of the disease between the island and mainland Africa: mobile phones sold by Tanzania’s telecoms groups including Vodafone, the UK mobile operator.Working together with researchers at Southampton university, Vodafone began compiling sets of location data from mobile phones in the areas where cases of the disease had been recorded. Mapping how populations move between locations has proved invaluable in tracking and responding to epidemics. The Zanzibar project has been replicated by academics across the continent to monitor other deadly diseases, including Ebola in west Africa.“Diseases don’t respect national borders,” says Andy Tatem, an epidemiologist at Southampton who has worked with Vodafone in Africa. “Understanding how diseases and pathogens flow through populations using mobile phone data is vital.”
      the best way to track the spread of the pandemic is to use heatmaps built on data of multiple phones which, if overlaid with medical data, can predict how the virus will spread and determine whether government measures are working.
      
    1. Svelte offers an immutable way — but it’s just a mask to hide “assignment”, because assignment triggers an update, but not immutability. So it’s enough to write todos=todos, after that Svelte triggers an update.
    1. How to Export Your Content If you log into Graphite before August 15th, you can download each file in any of the available formats offered. If you'd like a bulk download, I recommend (for the technically inclined) using the exporter tool I created. For those less technically inclined, Blockstack may have some options for you. Remember, Graphite never owned your content. Never had control of your content. And that was the real power of its offering. 
    1. Finding data

      You're right about data here. I follow some research out of the MIT Media lab by Cesar Hidalgo who may have some interesting data resources if you poke around.

      Some additional starting points:

    1. In this app, we have a <Hoverable> component that tracks whether the mouse is currently over it. It needs to pass that data back to the parent component, so that we can update the slotted contents. For this, we use slot props.
    1. We wanted a library designed specifically for a functional programming style, one that makes it easy to create functional pipelines, one that never mutates user data.
    1. Had it not been for the attentiveness of one person who went beyond the task of classifying galaxies into predetermined categories and was able to communicate this to the researchers via the online forum, what turned out to be important new phenomena might have gone undiscovered.

      Sometimes our attempts to improve data quality in citizen science projects can actually work against us. Pre-determined categories and strict regulations could prevent the reporting of important outliers.

    1. This is probably one of the biggest things to get used to in React – this flow where data goes out and then back in.
  4. Aug 2020
    1. Lozano, R., Fullman, N., Mumford, J. E., Knight, M., Barthelemy, C. M., Abbafati, C., Abbastabar, H., Abd-Allah, F., Abdollahi, M., Abedi, A., Abolhassani, H., Abosetugn, A. E., Abreu, L. G., Abrigo, M. R. M., Haimed, A. K. A., Abushouk, A. I., Adabi, M., Adebayo, O. M., Adekanmbi, V., … Murray, C. J. L. (2020). Measuring universal health coverage based on an index of effective coverage of health services in 204 countries and territories, 1990–2019: A systematic analysis for the Global Burden of Disease Study 2019. The Lancet, 0(0). https://doi.org/10.1016/S0140-6736(20)30750-9

    1. Ray, E. L., Wattanachit, N., Niemi, J., Kanji, A. H., House, K., Cramer, E. Y., Bracher, J., Zheng, A., Yamana, T. K., Xiong, X., Woody, S., Wang, Y., Wang, L., Walraven, R. L., Tomar, V., Sherratt, K., Sheldon, D., Reiner, R. C., Prakash, B. A., … Consortium, C.-19 F. H. (2020). Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the U.S. MedRxiv, 2020.08.19.20177493. https://doi.org/10.1101/2020.08.19.20177493

    1. Nguyen, L. H., Drew, D. A., Graham, M. S., Joshi, A. D., Guo, C.-G., Ma, W., Mehta, R. S., Warner, E. T., Sikavi, D. R., Lo, C.-H., Kwon, S., Song, M., Mucci, L. A., Stampfer, M. J., Willett, W. C., Eliassen, A. H., Hart, J. E., Chavarro, J. E., Rich-Edwards, J. W., … Zhang, F. (2020). Risk of COVID-19 among front-line health-care workers and the general community: A prospective cohort study. The Lancet Public Health, 0(0). https://doi.org/10.1016/S2468-2667(20)30164-X

    1. Menni, C., Valdes, A. M., Freidin, M. B., Sudre, C. H., Nguyen, L. H., Drew, D. A., ... & Visconti, A. (2020). Real-time tracking of self-reported symptoms to predict potential COVID-19. Nature Medicine, 1-4.

    1. Cluster 0 words: family, home, mother, war, house, dies, Cluster 0 titles: Schindler's List, One Flew Over the Cuckoo's Nest, Gone with the Wind, The Wizard of Oz, Titanic, Forrest Gump, E.T. the Extra-Terrestrial, The Silence of the Lambs, Gandhi, A Streetcar Named Desire, The Best Years of Our Lives, My Fair Lady, Ben-Hur, Doctor Zhivago, The Pianist, The Exorcist, Out of Africa, Good Will Hunting, Terms of Endearment, Giant, The Grapes of Wrath, Close Encounters of the Third Kind, The Graduate, Stagecoach, Wuthering Heights, Cluster 1 words: police, car, killed, murders, driving, house, Cluster 1 titles: Casablanca, Psycho, Sunset Blvd., Vertigo, Chinatown, Amadeus, High Noon, The French Connection, Fargo, Pulp Fiction, The Maltese Falcon, A Clockwork Orange, Double Indemnity, Rebel Without a Cause, The Third Man, North by Northwest, Cluster 2 words: father, new, york, new, brothers, apartments, Cluster 2 titles: The Godfather, Raging Bull, Citizen Kane, The Godfather: Part II, On the Waterfront, 12 Angry Men, Rocky, To Kill a Mockingbird, Braveheart, The Good, the Bad and the Ugly, The Apartment, Goodfellas, City Lights, It Happened One Night, Midnight Cowboy, Mr. Smith Goes to Washington, Rain Man, Annie Hall, Network, Taxi Driver, Rear Window, Cluster 3 words: george, dance, singing, john, love, perform, Cluster 3 titles: West Side Story, Singin' in the Rain, It's a Wonderful Life, Some Like It Hot, The Philadelphia Story, An American in Paris, The King's Speech, A Place in the Sun, Tootsie, Nashville, American Graffiti, Yankee Doodle Dandy, Cluster 4 words: killed, soldiers, captain, men, army, command, Cluster 4 titles: The Shawshank Redemption, Lawrence of Arabia, The Sound of Music, Star Wars, 2001: A Space Odyssey, The Bridge on the River Kwai, Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, Apocalypse Now, The Lord of the Rings: The Return of the King, Gladiator, From Here to Eternity, Saving Private Ryan, Unforgiven, Raiders of the Lost Ark, Patton, Jaws, Butch Cassidy and the Sundance Kid, The Treasure of the Sierra Madre, Platoon, Dances with Wolves, The Deer Hunter, All Quiet on the Western Front, Shane, The Green Mile, The African Queen, Mutiny on the Bounty,

      The top IMDB films fit into 2 basic clusters, and 4 main clusters (this project used K-means with a target of 5 but actually clusters 1 and 2 both fit the crime category, and all except Cluster 3 are centred around violence).

      1. War whilst with Family and at home (violence external whilst passively defending the safety of the self/family)
      2. Crime (violence on a smaller scale)
      3. New York crime family / mafia (violence in the family)
      4. Musicals (non-violence)
      5. War as soldiers (violence on a large scale on the front lines)

      If this list is representative of the human psyche we have only 2 basic modes of being: Violence / Musical.

    1. his dream of it being as easy to “insert facts, data, and models in political discussion as it is to insert emoji” 😉 speaks to a sort of consumerist, on-demand thirst for snippets, rather than a deep understanding of complexity. It’s app-informed, drag-and-drop data for instant government.