116 Matching Annotations
  1. Dec 2023
    1. “MLX” is more than just a technical solution; it is an innovative and user-friendly framework inspired by popular frameworks like PyTorch, Jax, and ArrayFire. It facilitates the training and deployment of AI models on Apple devices without sacrificing performance or compatibility.

      MLX (high overview)

  2. Apr 2023
    1. This is the space where AI can thrive, tirelessly processing these countless features of every patient I’ve ever treated, and every other patient treated by every other physician, giving us deep, vast insights. AI can help do this eventually, but it will first need to ingest millions of patient data sets that include those many features, the things the patients did (like take a specific medication), and the outcome.

      AI tools yes, not ChatGPT though. More contextualising and specialisation needed. And I'd add the notion that AI might be necessary as temporary fix, on our way to statistics. Its power is in weighing (literally) many more different factors then we could statistically figure out, also because of interdependencies between factors. Once that's done there may well be a path to less blackbox tooling like ML/DL towards logistic regression: https://pubmed.ncbi.nlm.nih.gov/33208887/ [[Machine learning niet beter dan Regressie 20201209145001]]

  3. Oct 2022
    1. As of today, a lot of the things in ML are not automated. They are manual or semi-manual.
    2. Not everything can be tested/evaluated with a metric like AUC or R2. Sometimes, people just have to check if things improved and not just metrics got better.
    3. I believe that packaging/building/deploying the vanilla, run-of-the-mill ML model will become common knowledge for backend devs.
  4. Aug 2022
  5. Sep 2021
    1. Machine Unlearning is a new area of research that aims to make models selectively “forget” specific data points it was trained on, along with the “learning” derived from it i.e. the influence of these training instances on model parameters. The goal is to remove all traces of a particular data point from a machine learning system, without affecting the aggregate model performance.Why is it important? Work is motivated in this area, in part by growing concerns around privacy and regulations like GDPR and the “Right to be Forgotten”. While multiple companies today allow users to request their private data be deleted, there is no way to request that all context learned by algorithms from this data be deleted as well. Furthermore, as we have covered previously, ML models suffer from information leakage and machine unlearning can be an important lever to combat this. In our view, there is another important reason why machine unlearning is important: it can help make recurring model training more efficient by making models forget those training examples that are outdated, or no longer matter.

      What is Machine UNlearning and why is it important

  6. Apr 2021
    1. Machine learning is an extension of linear regression in a few ways. Firstly is that modern ML

      Machine learning is an extension to linear model which deals with much more complicated situation where we take few different inputs and get outputs.

    2. Machine learning is using data we have (known as training data) to learn a function that we can use to make predictions about data we don’t have yet.

      Machine learning is linear regression and is used for making prediction

  7. Jan 2021
    1. A big reason for GPUs popularity in Machine Learning is that they are essentially fast matrix multiplication machines. Deep Learning is essentially matrix multiplication.

      Deep Learning is mostly about matrix multiplication

    2. Unity ML agents is a way for you to turn a video game into a Reinforcement Learning environment.

      Unity ML agents is a great way to practice RNN

    3. Haskell is the best functional programming language in the world and Neural Networks are functionsThis is the main motivation behind Hasktorch which lets you discover new kinds of Neural Network architectures by combining functional operators

      Haskell is a great solution for neural networks

    4. I can’t think of a single large company where the NLP team hasn’t experimented with HuggingFace. They add new Transformer models within days of the papers being published, they maintain tokenizers, datasets, data loaders, NLP apps. HuggingFace has created multiple layers of platforms that each could be a compelling company in its own right.

      HuggingFace company

    5. Keras is a user centric library whereas Tensorflow especially Tensorflow 1.0 is a machine centric library. ML researchers think in terms of terms of layers, automatic differentiation engines think in terms of computational graphs.As far as I’m concerned my time is more valuable than the cycles of a machine so I’d rather use something like Keras.

      Why simplicity of Keras is important

    6. Graduate Student Descent

      :)

    7. BERT engineer is now a full time job. Qualifications include:Some bash scriptingDeep knowledge of pip (starting a new environment is the suckier version of practicing scales)Waiting for new HuggingFace models to be releasedWatching Yannic Kilcher’s new Transformer paper the day it comes outRepeating what Yannic said at your team reading group

      Structure of a BERT engineer job

    8. “Useful” Machine Learning research on all datasets has essentially reduced to making Transformers faster, smaller and scale to longer sequence lengths.

      Typical type of advancement we see in ML

    9. I often get asked by young students new to Machine Learning, what math do I need to know for Deep Learning and my answer is Matrix Multiplication and Derivatives of square functions.

      Deep Neural Networks are a composition of matrix multiplications with the occasional non-linearity in between

  8. Dec 2020
    1. TFX and Tensorflow run anywhere Python runs, and that’s a lot of places

      You can run your Tensorflow models on:

    2. since TFX and Tensorflow were built by Google, it has first-class support in the Google Cloud Platform.

      TFX and Tensorflow work well with GCP

    3. After consideration, you decide to use Python as your programming language, Tensorflow for model building because you will be working with a large dataset that includes images, and Tensorflow Extended (TFX), an open-source tool released and used internally at Google, for building your pipelines.

      Sample tech stack of a ML project:

      • Python <--- programming
      • Tensorflow <--- model building (large with images)
      • TFX (Tensorflow Extended) <--- pipeline builder (model analysis, monitoring, serving, ...)
    4. These components has built-in support for ML modeling, training, serving, and even managing deployments to different targets.

      Components of TFX:

    5. Most data scientists feel that model deployment is a software engineering task and should be handled by software engineers because the required skills are more closely aligned with their day-to-day work. While this is somewhat true, data scientists who learn these skills will have an advantage, especially in lean organizations. Tools like TFX, Mlflow, Kubeflow can simplify the whole process of model deployment, and data scientists can (and should) quickly learn and use them. 

      As a Data Scientist, you shall think of practicing TFX, Mlflow or Kubeflow

    6. TFX Component called TensorFlow Model Analysis (TFMA) allows you to easily evaluate new models against current ones before deployment. 

      TFMA component of TFX seems to be its core functionality

    7. In general, for smaller businesses like startups, it is usually cheaper and better to use managed cloud services for your projects. 

      Advice for startups working with ML in production

    8. This has its pros and cons and may depend on your use case as well. Some of the pros to consider when considering using managed cloud services are:

      Pros of using cloud services:

      • They are cost-efficient
      • Quick setup and deployment
      • Efficient backup and recovery Cons of using cloud services:
      • Security issue, especially for sensitive data
      • Internet connectivity may affect work since everything runs online
      • Recurring costs
      • Limited control over tools
    9. The data is already in the cloud, so it may be better to build your ML system in the cloud. You’ll get better latency for I/O, easy scaling as data becomes larger (hundreds of gigabytes), and quick setup and configuration for any additional GPUs and TPUs. 

      If the data for your project is already in the cloud, try to stick to cloud solutions

    10. ML projects are never static. This is part of engineering and design that must be considered from the start. Here you should answer questions like:

      We need to prepare for these 2 things in ML projects:

      • How do we get feedback from a model in production?
      • How do you set up CD?
    11. The choice of framework is very important, as it can decide the continuity, maintenance, and use of a model. In this step, you must answer the following questions:

      Questions to ask before choosing a particular tool/framework:

      • What is the best tool for the task at hand?
      • Are the choice of tools open-source or closed?
      • How many platforms/targets support the tool?

      Try to compare the tools based on:

      • Efficiency
      • Popularity
      • Support
    12. These questions are important as they will guide you on what frameworks or tools to use, how to approach your problem, and how to design your ML model.

      Critical questions for ML projects:

      • How is your training data stored?
      • How large is your data?
      • How will you retrieve the data for training?
      • How will you retrieve data for prediction?
    13. you should not invest in an ML project if you have no plan to put it in production, except of course when doing pure research.

      Tip #1 when applying ML in business

    14. The difficulties in model deployment and management have given rise to a new, specialized role: the machine learning engineer. Machine learning engineers are closer to software engineers than typical data scientists, and as such, they are the ideal candidate to put models into production.

      Why Machine Learning Engineer role exists

    15. The goal of building a machine learning model is to solve a problem, and a machine learning model can only do so when it is in production and actively in use by consumers. As such, model deployment is as important as model building.

      Model deployment is as important as model building

    16. Venturebeat reports that 87% of data science projects never make it to production, while redapt claims it is 90%.
  9. Oct 2020
    1. Statistical techniques: average, quantiles, probability distribution, association rulesSupervised ML algorithms: logistic regression, neural net, time-series analysisUnsupervised ML algorithms: Cluster analysis, Bayesian network, Peer group analysis, break point analysis, Benford’s law (law of anomalous numbers)

      Typical techniques used in financial fraud classification

    2. In machine learning, parlance fraud detection is generally treated as a supervised classification problem, where observations are classified as “fraud” or “non-fraud” based on the features in those observations. It is also an interesting problem in ML research due to imbalanced data — i.e. there’s a very few cases of frauds in an extremely large amount of transactions.

      Financial fraud is generally solved as a supervised classification, but we've got the problem of imbalanced data

    3. With ever-increasing online transactions and production of a large volume of customer data, machine learning has been increasingly seen as an effective tool to detect and counter frauds. However, there is no specific tool, the silver bullet, that works for all kinds of fraud detection problems in every single industry. The nature of the problem is different in every case and every industry. Therefore every solution is carefully tailored within the domain of each industry.

      Machine learning in fraud detection

    1. The number of hidden neurons should be between the size of the input layer and the size of the output layer. The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. The number of hidden neurons should be less than twice the size of the input layer.

      3 rules of thumb while choosing the number of hidden layers and neurons

    1. Sprawdźmy który rodzaj modelu daje najlepszą skuteczność: Python sns.boxplot(data=models_df, x='score', y='model') 1 sns.boxplot(data=models_df, x='score', y='model')

      After comparing the pipelined ML models, we can easily display a comparison boxplot.

      Well working ML models: 1) XGBoost 2) LightGBM 3) CatBoost.

    2. Przy tych danych wygląda, że właściwie nie ma większej różnicy (nie bijemy się tutaj o 0.01 punktu procentowego poprawy accuracy modelu). Może więc czas treningu jest istotny? Python sns.boxplot(data=models_df, x='time_elapsed', y='model') 1 sns.boxplot(data=models_df, x='time_elapsed', y='model')

      Training time of some popular ML models. After considering the performance, it's worth using XGBoost and LightGBM.

    3. Teraz w zagnieżdżonych pętlach możemy sprawdzić każdy z każdym podmieniając klasyfikatory i transformatory (cała pętla trochę się kręci):

      Example (below) of when creating pipelines with scikit-learn makes sense. Basically, it's convenient to use it while comparing multiple models in a loop

  10. Sep 2020
    1. Tribuo is a machine learning library in Java that provides multi-class classification, regression, clustering, anomaly detection and multi-label classification.

      Tribuo - Java ML library from Oracle

    1. but companies will discover that ML is (for the vast majority) a shiny object with no discernible ROI. Like Big Data, I think we'll see a few companies execute well and actually get some value, while most will just jump to the next shiny thing in a year or two.

      As long as ROI isn't clearly visible in ML, as long it might not bring more ML positions

  11. Aug 2020
    1. Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

      After all, magically you get the right Python snippet (based on scikit-learn)

    2. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming

      TPOT automates the following:

      • feature selection
      • feature preprocessing
      • feature construction
      • model selection
      • parameter optimisation
    1. I learned and did to make it possible to do Automated Speech Recognition research on a Colab instance.

      It's possible to do Automated Speech Recognition with Google Colab.

      Notebook instance of this post

  12. Jul 2020
    1. Libra is a machine learning API designed for non-technical users. This means that it assumes that you have no background in ML whatsoever.

      With Libra you can write your ML code much faster:

      For example, that's how it compares to Keras.

  13. May 2020
    1. Amazon Machine Learning Deprecated. Use SageMaker instead.

      Instead of Amazon Machine Learning use Amazon SageMaker

  14. Apr 2020
    1. Another approach is to use a tool like H2O to export the model as a POJO in a JAR Java library, which you can then add as a dependency in your application. The benefit of this approach is that you can train the models in a language familiar to Data Scientists, such as Python or R, and export the model as a compiled binary that runs in a different target environment (JVM), which can be faster at inference time

      H2O - export models trained in Python/R as a POJO in JAR

    2. Continuous Delivery for Machine Learning (CD4ML) is a software engineering approach in which a cross-functional team produces machine learning applications based on code, data, and models in small and safe increments that can be reproduced and reliably released at any time, in short adaptation cycles.

      Continuous Delivery for Machine Learning (CD4ML) (long definition)

      Basic principles:

      • software engineering approach
      • cross-functional team
      • producing software based on code, data, and ml models
      • small and safe increments
      • reproducible and reliable software release
      • short adaptation cycles
    3. In order to formalise the model training process in code, we used an open source tool called DVC (Data Science Version Control). It provides similar semantics to Git, but also solves a few ML-specific problems:

      DVC - transform model training process into code.

      Advantages:

      • it has multiple backend plugins to fetch and store large files on an external storage outside of the source control repository;
      • it can keep track of those files' versions, allowing us to retrain our models when the data changes;
      • it keeps track of the dependency graph and commands used to execute the ML pipeline, allowing the process to be reproduced in other environments;
      • it can integrate with Git branches to allow multiple experiments to co-exist
    4. Machine Learning pipeline for our Sales Forecasting problem, and the 3 steps to automate it with DVC

      Sales Forecasting process Sales Forecasting process

    5. Continuous Delivery for Machine Learning end-to-end process

      end-to-end process

    6. common functional silos in large organizations can create barriers, stifling the ability to automate the end-to-end process of deploying ML applications to production

      Common ML process (leading to delays and frictions) ML process

    7. There are different types of testing that can be introduced in the ML workflow.

      Automated tests for ML system:

      • validating data
      • validating component integration
      • validating the model quality
      • validating model bias and fairness
    8. example of how to combine different test pyramids for data, model, and code in CD4ML

      Combining tests for data (purple), model (green) and code (blue) testing

    9. A deployment pipeline automates the process for getting software from version control into production, including all the stages, approvals, testing, and deployment to different environments

      Deployment pipeline

    10. We chose to use GoCD as our Continuous Delivery tool, as it was built with the concept of pipelines as a first-class concern

      GoCD - open source Continuous Delivery tool

    11. Continuous Delivery for Machine Learning (CD4ML) is the discipline of bringing Continuous Delivery principles and practices to Machine Learning applications.

      Continuous Delivery for Machine Learning (CD4ML)

    1. Common questions are: How many users click this button; what % of users that visit a screen click this button; how many users have signed up by region or account type? However, the data needed to answer those questions may not exist! If the data does exists, it’s likely “dirty” - undocumented, tough to find or could be factually inaccurate. It’ll be tough to work with! You could spend hours or days attempting to answer a single question only to discover that you can’t sufficiently answer it for a stakeholder. In machine learning, you may be asked to optimize some process or experience for consumers. However, there’s uncertainty with how much, if at all, the experience can be improved!

      Common types of problems you might be working with in Data Science / Machine Learning industry

    1. In data science community the performance of the model on the test dataset is one of the most important things people look at. Just look at the competitions on kaggle.com. They are extremely focused on test dataset and the performance of these models is really good.

      In data science, performance of the model on the test model is the most important metric for the majority.

      It's not always the best measurement since the most efficient model can completely misperform while receiving a different type of a dataset

    2. It's basically a look up table, interpolating between known data points. Except, unlike other interpolants like 'linear', 'nearest neighbour' or 'cubic', the underlying functional form is determined to best represent the kind of data you have.

      You can describe AI/ML methods as a look up table that adjusts to your data points unlike other interpolants (linear,nearest neighbor or cubic)

  15. Mar 2020
    1. Another nice SQL script paired with CRON jobs was the one that reminded people of carts that was left for more than 48 hours. Select from cart where state is not empty and last date is more than or equal to 48hrs.... Set this as a CRON that fires at 2AM everyday, period with less activity and traffic. People wake up to emails reminding them about their abandoned carts. Then sit watch magic happens. No AI/ML needed here. Just good 'ol SQL + Bash.

      Another example of using SQL + CRON job + Bash to remind customers of cart that was left (again no ML needed here)

    2. I will write a query like select from order table where last shop date is 3 or greater months. When we get this information, we will send a nice "we miss you, come back and here's X Naira voucher" email. The conversation rate for this one was always greater than 50%.

      Sometimes SQL is much more than enough (you don't need ML)

    1. Here’s a very simple example of how a VQA system might answer the question “what color is the triangle?”
      1. Look for shapes and colours using CNN.
      2. Understand the question type with NLP.
      3. Determine strength for each possible answer.
      4. Convert each answer strength to % probability
    2. Visual Question Answering (VQA): answering open-ended questions about images. VQA is interesting because it requires combining visual and language understanding.

      Visual Question Answering (VQA) = visual + language understanding

    3. Most VQA models would use some kind of Recurrent Neural Network (RNN) to process the question input
      • Most VQA will use RNN to process the question input
      • Easier VQA datasets shall be fine with using BOW to transport vector input to a standard (feedforward) NN
    4. The standard approach to performing VQA looks something like this: Process the image. Process the question. Combine features from steps 1/2. Assign probabilities to each possible answer.

      Approach to handle VQA problems:

      animation

    1. Script mode takes a function/class, reinterprets the Python code and directly outputs the TorchScript IR. This allows it to support arbitrary code, however it essentially needs to reinterpret Python

      Script mode in PyTorch

    2. In 2019, the war for ML frameworks has two remaining main contenders: PyTorch and TensorFlow. My analysis suggests that researchers are abandoning TensorFlow and flocking to PyTorch in droves. Meanwhile in industry, Tensorflow is currently the platform of choice, but that may not be true for long
      • in research: PyTorch > TensorFlow
      • in industry: TensorFlow > PyTorch
    3. Why do researchers love PyTorch?
      • simplicity <--- pythonic like, easily integrates with its ecosystem
      • great API <--- TensorFlow used to switch API many times
      • performance <--- it's not so clear if it's faster than TensorFlow
    4. Researchers care about how fast they can iterate on their research, which is typically on relatively small datasets (datasets that can fit on one machine) and run on <8 GPUs. This is not typically gated heavily by performance considerations, but by their ability to quickly implement new ideas. On the other hand, industry considers performance to be of the utmost priority. While 10% faster runtime means nothing to a researcher, that could directly translate to millions of savings for a company

      Researchers value how fast they can implement tools on their research.

      Industry considers value performance as it brings money.

    5. From the early academic outputs Caffe and Theano to the massive industry-backed PyTorch and TensorFlow

      It's not easy to track all the ML frameworks

      Caffe, Theano ---> PyTorch, TensorFlow

    6. When you run a PyTorch/TensorFlow model, most of the work isn’t actually being done in the framework itself, but rather by third party kernels. These kernels are often provided by the hardware vendor, and consist of operator libraries that higher-level frameworks can take advantage of. These are things like MKLDNN (for CPU) or cuDNN (for Nvidia GPUs). Higher-level frameworks break their computational graphs into chunks, which can then call these computational libraries. These libraries represent thousands of man hours of effort, and are often optimized for the architecture and application to yield the best performance

      What happens behind when you run ML frameworks

    7. Jax is built by the same people who built the original Autograd, and features both forward- and reverse-mode auto-differentiation. This allows computation of higher order derivatives orders of magnitude faster than what PyTorch/TensorFlow can offer

      Jax

    8. TensorFlow will always have a captive audience within Google/DeepMind, but I wonder whether Google will eventually relax this

      Generally, PyTorch will be much more favorised that maybe one day it will replace TensorFlow at Google's offices

    9. Once your PyTorch model is in this IR, we gain all the benefits of graph mode. We can deploy PyTorch models in C++ without a Python dependency , or optimize it.

    10. At their core, PyTorch and Tensorflow are auto-differentiation frameworks

      auto-differentation = takes derivative of some function. It can be implemented in many ways so most ML frameworks choose "reverse-mode auto-differentation" (known as "backpropagation")

    11. At the API level, TensorFlow eager mode is essentially identical to PyTorch’s eager mode, originally made popular by Chainer. This gives TensorFlow most of the advantages of PyTorch’s eager mode (ease of use, debuggability, and etc.) However, this also gives TensorFlow the same disadvantages. TensorFlow eager models can’t be exported to a non-Python environment, they can’t be optimized, they can’t run on mobile, etc. This puts TensorFlow in the same position as PyTorch, and they resolve it in essentially the same way - you can trace your code (tf.function) or reinterpret the Python code (Autograph).

      Tensorflow Eager

    12. TensorFlow came out years before PyTorch, and industry is slower to adopt new technologies than researchers

      Reason why PyTorch wasn't previously more popular than TensorFlow

    13. The PyTorch JIT is an intermediate representation (IR) for PyTorch called TorchScript. TorchScript is the “graph” representation of PyTorch. You can turn a regular PyTorch model into TorchScript by using either tracing or script mode.

      PyTorch JIT

    14. In 2018, PyTorch was a minority. Now, it is an overwhelming majority, with 69% of CVPR using PyTorch, 75+% of both NAACL and ACL, and 50+% of ICLR and ICML

    15. Tracing takes a function and an input, records the operations that were executed with that input, and constructs the IR. Although straightforward, tracing has its downsides. For example, it can’t capture control flow that didn’t execute. For example, it can’t capture the false block of a conditional if it executed the true block

      Tracing mode in PyTorch

    16. On the other hand, industry has a litany of restrictions/requirements

      TensorFlow's requirements:

      • no Python <--- overhead of the Python runtime might be too much to take
      • mobile <--- Python can't be embedded in the mobile binary
      • serving <--- no-downtime updates of models, switching between models seamlessly, etc.
    17. TensorFlow is still the dominant framework. For example, based on data [2] [3] from 2018 to 2019, TensorFlow had 1541 new job listings vs. 1437 job listings for PyTorch on public job boards, 3230 new TensorFlow Medium articles vs. 1200 PyTorch, 13.7k new GitHub stars for TensorFlow vs 7.2k for PyTorch, etc

      Nowadays, the numbers still play against PyTorch

    18. every major conference in 2019 has had a majority of papers implemented in PyTorch

      Legend:

      • CVPR, ICCV, ECCV - computer vision conferences
      • NAACL, ACL, EMNLP - NLP conferences
      • ICML, ICLR, NeurIPS - general ML conferences

      Interactive version

      PyTorch vs TensorFlow

    19. the transition from TensorFlow 1.0 to 2.0 will be difficult and provides a natural point for companies to evaluate PyTorch

      Chance of faster transition to PyTorch in industry

    1. For the application of machine learning in finance, it’s still very early days. Some of the stuff people have been doing in finance for a long time is simple machine learning, and some people were using neural networks back in the 80s and 90s.   But now we have a lot more data and a lot more computing power, so with our creativity in machine learning research, “We are so much in the beginning that we can’t even picture where we’re going to be 20 years from now”

      We are just in time to apply modern ML techniques to financial industry

    2. ability to learn from data e.g. OpenAI and the Rubik’s Cube and DeepMind with AlphaGo required the equivalent of thousands of years of gameplay to achieve those milestones

      Even while making the perfect algorithm, we have to expect long hours of learning

    3. Pedro’s book “The Master Algorithm” takes readers on a journey through the five dominant paradigms of machine learning research on a quest for the master  algorithm. Along the way, Pedro wanted to abstract away from the mechanics so that a broad audience, from the CXO to the consumer, can understand how machine learning is shaping our lives

      "The Master Algorithm" book seems to be too abstract in such a case; however, it covers the following 5 paradigms:

      • Rule based learning (Decision trees, Random Forests, etc)
      • Connectivism (neural networks, etc)
      • Bayesian (Naive Bayes, Bayesian Networks, Probabilistic Graphical Models)
      • Analogy (KNN & SVMs)
      • Unsupervised Learning (Clustering, dimensionality reduction, etc)
    4. We’ve always lived in a world which we didn’t completely understand but now we’re living in a world designed by us – for Pedro, that’s actually an improvement

      We never really understood the surroundings, but now we have a great impact to modify it

    5. But at the end of the day, what we know about neuroscience today is not enough to determine what we do in AI, it’s only enough to give us ideas.  In fact it’s a two way street – AI can help us to learn how the brain works and this loop between the two disciplines is a very important one and is growing very rapidly

      Neuroscience can help us understand AI and the opposite

    6. Pedro believes that success will come from unifying the different major types of learning and their master algorithms –not just combining, but unifying them such that “it feels like using one thing”

      Interesting point of view on designing the master algorithm

    7. if you look at the number  of connections that the state of the art machine learning systems for some of these problems have, they’re more than many animals – they have many hundreds of millions or billions of connections

      State of the art ML systems are composed of millions or billions of connections (close to humans)

    8. There was this period of a couple of 100 years where we understood our technology.  Now we just have to learn live in a world where we don’t understand the machines that work  for us, we just have to be confident they are working for us and doing their best

      Should we just accept the fact that machines will rule the world with a mysterious intelligence?

    1. team began its analysis on YouTube 8M, a publicly available dataset of YouTube videos

      YouTube 8M - public dataset of YouTube videos. With this, we can analyse video features like: *color

      • illumination
      • many types of faces
      • thousands of objects
      • several landscapes
    2. The trailer release for a new movie is a highly anticipated event that can help predict future success, so it behooves the business to ensure the trailer is hitting the right notes with moviegoers. To achieve this goal, the 20th Century Fox data science team partnered with Google’s Advanced Solutions Lab to create Merlin Video, a computer vision tool that learns dense representations of movie trailers to help predict a specific trailer’s future moviegoing audience

      Merlin Video - computer vision tool to help predict a specific trailer's moviegoing audience

    3. pipeline also includes a distance-based “collaborative filtering” (CF) model and a logistic regression layer that combines all the model outputs together to produce the movie attendance probability

      other elements of pipeline:

      • collaborative filtering (CF) model
      • logistic regression layer
    4. Merlin returns the following labels: facial_hair, beard, screenshot, chin, human, film

      Types of features Merlin Video can generate from a single trailer frame.

      Final result of feature collecting and ordering:

    5. The obvious choice was Cloud Machine Learning Engine (Cloud ML Engine), in conjunction with the TensorFlow deep learning framework

      Merlin Video is powered by:

      • Cloud Machine Learning Engine - automating infrastructure (resources, provisioning and monitoring)
      • TensorFlow
      • Cloud Dataflow and Data Studio - Dataflow generates reports in Data Studio
      • BigQuery and BigQueryML - used in a final step to merge Merlin’s millions of customer predictions with other data sources to create useful reports and to quickly prototype media plans for marketing campaigns
    6. custom model learns the temporal sequencing of labels in the movie trailer

      Temporal sequencing - times of different shots (e.g. long or short).

      Temporal sequencing can convey information on:

      • movie type
      • movie plot
      • roles of the main characters
      • filmmakers' cinematographic choices.

      When combined with historical customer data, sequencing analysis can be used to create predictions of customer behavior.

      arxiv paper on Merlin Video

    7. The elasticity of Cloud ML Engine allowed the data science team to iterate and test quickly, without compromising the integrity of the deep learning model

      Cloud ML Engine reduced the deployment time from months to days

    8. Architecture flow diagram for Merlin

    9. The first challenge is the temporal position of the labels in the trailer: it matters when the labels occur in the trailer. The second challenge is the high dimensionality of this data

      2 challenges that we find in labelling video clips: occurrence and volume of labels

    10. When it comes to movies, analyzing text taken from a script is limiting because it only provides a skeleton of the story, without any of the additional dynamism that can entice an audience to see a movie

      Analysing movie script isn't enough to predict the overall movie's attractiveness to the audience

    11. 20th Century Fox has been using this tool since the release of The Greatest Showman in 2017, and continues to use it to inform their latest releases

      The Merlin Video tool is used nowadays by 20th Century Fox

    12. model is trained end-to-end, and the loss of the logistic regression is back-propagated to all the trainable components (weights). Merlin’s data pipeline is refreshed weekly to account for new trailer releases

      Way the model is trained and located in the pipeline

    13. After a movie’s release, we are able to process the data on which movies were previously seen by that audience. The table below shows the top 20 actual moviegoer audiences (Comp ACT) compared to the top 20 predicted audiences (Comp PRED)

      Way of validating the Merlin model

  16. Oct 2018
  17. Aug 2017
    1. introduce topic modeling to those not yet fully converted aware of its potential.

      Is resistance futile?

    2. They’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination.
    1. In writing the description of our reverse engineering work below, we deliberately avoid terms that are commonly used in Machine Learning, where labels are "true," "correct," or "gold standard." This linguistic distinction highlights the fundamentally different perspective that humanists have on classification as a tool. Our goal is not to create a system that mimics the decisions of a human annotator, but rather to better represent the porous boundaries between labels and identify the piles on which a story could have been placed over a century ago late on a cold wintry night in a dimly lit schoolhouse in eastern Jutland. We note the contrast between our use of computers to problematize existing distinctions and the common concern in the Humanities that computers deal only with binaries and black-and-white distinctions.

      Valuable insight and eloquently phrased.

    2. Our goal is not to treat existing classifications as "ground truth" labels and build machine learning tools to mimic them, but rather to use computation to better quantify the variability and uncertainty of those classifications.
    3. Our goal with this classification method can be seen as the inverse of usual machine learning classifiers.
  18. May 2014