95 Matching Annotations
  1. Nov 2017
    1. In Bash you quite often need to check to see if a variable has been set or has a value other than an empty string. This can be done using the -n or -z string comparison operators.

      Two most useful commands in bash

    1. select top 1 * from newsletters where IsActive = 1 order by PublishDate desc

      This doesn't require a full table scan or a join operation. That's just COOL

    1. // The search space within the array is changing for each round - but the list // is still the same size. Thus, k does not need to be updated with each round.

      If you do not do this you will have to keep moving the kth index every time respective of the size of the array.

      if (pivotIdx < k - 1): quickSelect(A, pivotIdx + 1, end, k-1 - (pivotIdx + 1)); if (pivotIdx > k - 1): quickSelect(A, start, pivotIdx - 1, k - 1);

    1. an hour

      hour is pronounced more like OUR therefore you use the article an and NOT a

    2. A European (again we see the y sound coming from a vowel)

      Where would you use a and an

    1. They have a very simplistic view of the activity being monitored by only distilling it down into only a few dimensions for the rule to interrogate

      Number of dimensions need to be large. In normal database systems these dimensions are small.

    2. UML automatically finds these hidden patterns to link seemingly unrelated accounts and customers. These links can be one of thousands of data fields that the UML model ingests.

      Why does this have to be done in a different system?

    1. SubscribePattern allows you to use a regex to specify topics of interest

      This can remove the need to reload the kafka writers in order to take consume messages.

      regex - "topic-ua-*"

    2. The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via spark.streaming.kafka.consumer.cache.maxCapacity.

      You might need this for keeping track of all partitions consumed.

  2. Sep 2017
    1. Amazon integrated customer data and payment information with e-book distribution and its Amazon publishing initiative

      Customer data (big data) + payment info (where's the money) + e-book distribution (infrastructure: kindle store and kindle device's seamless integration)

      Earlier guys integrated: Procurement (writers initial draft) + editing + marketing + distribution = Think book reviews and author tours on talk shows.

      Amazon's idea is more insightful and focussed on individual customers and not shooting in the DARK :)

    1. up vote 15 down vote This is one of the few reasons I like to use vim's mouse mode. If you use the GUI version, or your terminal supports sending drag events (such as xterm or rxvt-unicode) you can click on the split line and drag to resize the window exactly where you want, without a lot of guess work using the ctrl-w plus,minus,less,greater combinations. In terminal versions, you have to set mouse mode properly for this to work :set mouse=n (I use 'n', but 'a' also works) and you have to set the tty mouse type :set ttymouse=xterm2 A lot of people say that a lot of time is wasted using the mouse (mostly due to the time it takes to move your hand from the keyboard to the mouse and back), but I find that, in this case, the time saved by having immediate feedback while adjusting window sized and the quickness of re-resizing (keep movving the mouse instead of typing another key sequence) outweighs the delay of moingmy hand.

      Simply amazing setup to get multiple screens vim

    1. "What we think about is that there's a conversation, and inside of that conversation you have a contextual place where you can have all of the interactions that you want or have or need to have with a brand or service, and it can take multiple forms. It can be buttons, it can be UI [user interface] and it can be conversational when it needs to be," Marcus said.

      This is exactly what in context payments mean.

  3. Aug 2017
    1. “throwing seniors under the bus” (to use the words of one senator) by keeping interest rates low.

      Since interest rates are low, seniors who are trying to live off of their savings will be in trouble because their money doesn't earn anything for them.

    2. real interest rates are determined by a wide range of economic factors, including prospects for economic growth—not by the Fed.

      Real interest rate = interest rate - inflation

  4. Jul 2017
    1. In distributed mode, you start many worker processes using the same group.id and they automatically coordinate to schedule execution of connectors and tasks across all available workers. I

      Distributed workers.

      group.id = "SHOUDL BE THE SAME FOR ALL WORKERS"

    2. Connectors and tasks are logical units of work and must be scheduled to execute in a process. Kafka Connect calls these processes workers and has two types of workers: standalone and distributed.

      Workers = JVM processes

    1. up vote 7 down vote accepted When you are starting your kafka broker you can define set of properties in conf/server.properties file. This file is just key value property file. One of the property is auto.create.topics.enable if it set tot true(by default) kafka will create topic automatically when you send message to non existing topic. All config options you can find here Imho Simple rule for creating topics is the following: number of replicas must be not less than number of nodes that you have. Number of topics must be the multiplier of number of node in your cluster for example: You have 9 node cluster your topic must have 9 partitions and 9 replicas or 18 partitions and 9 replicas or 36 partitions and 9 replicas and so on

      Number of replicas = #replicas Number of nodes = #nodes Number of topics = #topic

      replicas >= #nodes

      k x (#topics) = #nodes

    1. The students who exerted more self-control were not more successful in accomplishing their goals. It was the students who experienced fewer temptations overall who were more successful when the researchers checked back in at the end of the semester.

      Reduce the number of distractions you get better results.

    1. ab -t 15 -k -T "application/vnd.kafka.binary.v1+json" -p postfile http://localhost:8082/topics/test

      ab benchmark

    1. Owning stock gives you the right to vote in shareholder meetings, receive dividends (which are the company’s profits) if and when they are distributed, and it gives you the right to sell your shares to somebody else.

      dividends are profits shared amongst share holders

  5. Jun 2017
    1. in sync replicas (ISRs) should be exactly equal to the total number of replicas.

      ISRs are a very imp metric

    2. Kafka metrics can be broken down into three categories:Kafka server (broker) metricsProducer metricsConsumer metrics

      3 Metrics:

      • Broker
      • Producer (Netty)
      • Consumer (SECOR)
    1. "isr" is the set of "in-sync" replicas.

      ISR are pretty import as when nodes go down you will see replicas created later.

    1. public void increment(String label, String topic) { + Stats.incr(label); + }

      import io.prometheus.client.Counter; public static final Counter requests = Counter.build() .name("requests_total").help("Total requests.").register();

      public void increment(String label, String topic) { requests.inc(); // Your code here. } }

    2. The component is pluggable, you can configure ingestion of metrics to any monitoring system by implementing MetricCollector interface. By default metrics are sent to Ostrich, statistic library bundled into secor.

      Just inject prometheus simple client into this module and you could get prometheus working for secor.

    1. Shannon held up a BBC micro:bit board, which runs MicroPython and has been given to students in the UK, and noted that it only has 16KB of memory.

      check out MicroPython

    1. You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let’s say your target throughput is t.

      t = throughput (QPS) p = single partition for production c = consumption

    1. Messages are immediately written to the filesystem when they are received. Messages are not deleted when they are read but retained with some configurable SLA (say a few days or a week)
    1. ZooKeeper snapshots can be one such a source of concurrent writes, and ideally should be written on a disk group separate from the transaction log.

      zookeeper maintains concurrency in its own way.

    2. If you do end up sharing the ensemble, you might want to use the chroot feature. With chroot, you give each application its own namespace.

      jail zookeeper instance from the other apps

    1. In merced, we used the low-level simple consumer and wrote our own work dispatcher to get precise control.

      difference between merced and secor

    1. A better alternative is at least once message delivery. For at least once delivery, the consumer reads data from a partition, processes the message, and then commits the offset of the message it has processed. In this case, the consumer could crash between processing the message and committing the offset and when the consumer restarts it will process the message again. This leads to duplicate messages in downstream systems but no data loss.

      This is what SECOR does.

    2. no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.

      Retries should be built into the consumer and producer code. If leader for the partition fails, you will see a LeaderNotAvailable Exception.

    3. By electing a new leader as soon as possible messages may be dropped but we will minimized downtime as any new machine can be leader.

      two scenarios to get the leader back: 1.) Wait to bring the master back online. 2.) Or elect the first node that comes back up. But in this scenario if that replica partition was a bit behind the master then the time from when this replica went down to when the master went down. All that data is Lost.

      SO there is a trade off between availability and consistency. (Durability)

    4. keep in mind that these guarantees hold as long as you are producing to one partition and consuming from one partition.

      This is very important a 1-to-1 mapping between writer and reader with partition. If you have more producers per partition or more consumers per partition your consistency is going to go haywire

    1. On every received heartbeat, the coordinator starts (or resets) a timer. If no heartbeat is received when the timer expires, the coordinator marks the member dead and signals the rest of the group that they should rejoin so that partitions can be reassigned. The duration of the timer is known as the session timeout and is configured on the client with the setting session.timeout.ms. 

      Time to live for the consumers. If the heartbeat doesn't reach the co-ordindator in this duration then the co-ordinator redistributes the partitions to the remaining consumers in the consumer group.

    2. The high watermark is the offset of the last message that was successfully copied to all of the log’s replicas.

      High Watermark: messages copied over to log replicas

    3. Kafka new Client which uses a different protocol for consumption in a distributed environment.

    4. Kafka scales topic consumption by distributing partitions among a consumer group, which is a set of consumers sharing a common group identifier.

      Topic consumption is distributed among a list of consumer group.

    1. In a recent analysis of more than 500 billion events collected from multiple global online services, 18% of user accounts that originated from cloud service IP ranges were fraudulent.

      Really cool statistics.

    1. Kafka consumer offset management protocol to keep track of what’s been uploaded to S3

      consumers keep track of what's written and where it left off by looking at kafka consumer offsets rather than checking S3 since S3 is an eventually consistent system.

    2. Data lost or corrupted at this stage isn’t recoverable so the greatest design objective for Secor is data integrity.

      data loss in S3 is being mitigated.

    1. Replication is important for two primary reasons:
      • HA with fail-over mechanism.
      • scale out your search volume/throughput as searches can happen based on your replicas in parallel.
    2. Sharding is important for two primary reasons:
      • horizontal scaling of a single index
      • parallelize seek operations on multiple shards when index gets too big
    3. , Elasticsearch provides the ability to subdivide your index into multiple pieces called shards.

      Think of shard as the indexes broken down further to span over multiple nodes in your cluster.

    4. An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

      Indexes may overflow the disk space. Hence you want to get the most out of your instances by indexing the nodes.

    1. The create index API allows to instantiate an index. Elasticsearch provides support for multiple indices, including executing operations across several indices.

      In this you could create different shard size per Index basis on Elastic. Super useful when you have a single cluster but multi-tenant

    1. incidents are an unavoidable reality of working with distributed systems, no matter how reliable. A prompt alerting solution should be an integral part of the design,

      see how it can hook into the current logging mechanism

    2. Consumers in this group are designed to be dead-simple, performant, and highly resilient. Since the data copied verbatim, no code upgrades are required to support new message types.

      exactly what we want

  6. May 2017
    1. tar -zxvf target/secor-0.1-SNAPSHOT-bin.tar.gz -C ${SECOR_INSTALL_DIR}

      linux command to create a new tar ball in a different directory. Might come handy

    1. With Flume & FlumeNG, and a File channel, if you loose a broker node you will lose access to those events until you recover that disk.

      In flume you loose events if the disk is down. This is very bad for our usecase.

    1. Optimum buffer size is related to a number of things: file system block size, CPU cache size and cache latency. Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well. This is why you see most buffers sized as a power of 2, and generally larger than (or equal to) the disk block size. This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads. Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with. So, I suspect that if you ran a test with different cache sizes (haven't done this myself), you will probably find a big impact of cache size up to the size of the file system block. Above that, I suspect that things would level out pretty quickly. There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type). This leads to the 'real world' answer: If your app is like 99% out there, set the cache size to 8192 and move on (even better, choose encapsulation over performance and use BufferedInputStream to hide the details). If you are in the 1% of apps that are highly dependent on disk throughput, craft your implementation so you can swap out different disk interaction strategies, and provide the knobs and dials to allow your users to test and optimize (or come up with some self optimizing system).

      What's the cache size to keep when reading from file to a buffer?

    1. The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

      irrespective of the fact that the consumer has consumed the message that message is kept in kafka for the entire retention policy duration.

      You can have two or more consumer groups: 1 -> real time 2 -> back up consumer group

    2. Kafka for Stream Processing

      Could be something we can consider for directing data from a raw log to a tenant based topic.

    3. replication factor N, we will tolerate up to N-1 server failures without losing any records

      Replication Factor means number of nodes/brokers which could go down before we start losing data.

      So if you have a replication factor of 6 for a 11 node cluster, then you will be fault tolerant till 5 nodes go down. After that point you are going to loose data for a particular partition.

    4. Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.

      ordering is guaranteed.

    5. Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

      kafka takes care of the consumer groups. Just create one Consumer Group for each topic.

    6. The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions.

      partitions of logs are per TOPIC basis

    1. The first limitation is that each partition is physically represented as a directory of one or more segment files. So you will have at least one directory and several files per partition. Depending on your operating system and filesystem this will eventually become painful. However this is a per-node limit and is easily avoided by just adding more total nodes in the cluster.

      total number of topics supported depends on the total number of partitions per topic.

      partition = directory of 1 or more segment files This is a per node limit

    1. the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions.

      This is an important thing to keep in mind. If we need massive parallelism we need to have more partitions.

    1. The offset the ordering of messages as an immutable sequence. Kafka maintains this message ordering for you.

      Kafka maintains the ordering for you...

    1. So to a start up founder, timing correctly when to do something is far more important than doing that thing amazingly well

      Words of wisdom - The act of doing should go hand in hand with the act of figuring out when to do something.

    2. "I don't have a system and need to build one."

      It's a POC - GSD and most importantly see what sticks.

    1. ($20*3)-($20*3*.1) = $54

      10% of $2000(cost of camer) * 3days = Rental Price

      Rental Price - Commission = Rental Made This guy totally forgot taxes here.... :)

      54$ for 3 days 365 days a year about 50 % usage so roughly 180 days. $54 for 3 days $? for 180 days = $3240 about 740$ profit per year for a $2000 investment if he's 50% utilized over the year.

      Camera's Man this guy needed to crunch some more numbers. Camera's have compatibility issues....

    2. We should have built the absolute minimum and then manually maintained transactions through the backend if need be. To hell with formality. We were in the business of proving a hypothesis yet we acted as if it had already been proven.

      quick and dirty solution and see if it sticks...

    1. Listerine, for example, started life on the shelf as an antiseptic, sold as both floor cleaner and a treatment for gonorrhoea. But it wasn't a runaway success until it was marketed as a treatment for bad breath. 

      Interesting use-case

    2. But many ideas are destined for improvement

      you may start with something and end up with something totally different

      1. Brainstorm.
      2. Break the idea and fix idea. (Scrutinize)
      3. Permeate and let the idea seep into you.
      4. Execute - execute -execute
    3. There are many people with good ideas who don't have the means, the will, or the courage to action them. Similarly, there are very talented business people who have no ideas, but are brilliant at the execution.

      figure out what you want to be and continue on that path, get better at it, and invest time and effort into it.

    1. volume, velocity, and variety

      volume: The actual size of traffic

      Velocity: How fast does the traffic show up.

      Variety: Refers to data that can be unstructured, semi structured or multi structured.

    1. replication-factor 3

      If n-1=2 nodes go down you will start loosing data. So that means if both the nodes go down you will loose data.

    2. For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

      for Eg for a given topic there are 11 brokers/servers and for each topic the replication factor is 6. That means the topic will start loosing data if more than 5 brokers go down.

    3. The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

      The coolest feature: this way all you need to do is add new consumers in a consumer group to auto scale per topic

    4. Consumers label themselves with a consumer group name

      maintain separate consumer group per tenant basis. Helps to scale out when we have more load per tenant.

    5. The producer is responsible for choosing which record to assign to which partition within the topic.

      Producer can publish to a specific topics

    6. individual partition must fit on the servers that host it

      Each Partition is bounded by the server that hosts that partition.

    7. the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes.

      partition offset maintained by kafka. Offset number is maintained so that if the consumer goes down nothing breaks.

    8. the retention policy is set to two days, then for the two days after a record is published,

      Might have to tweek this based on the persistence level we want to keep.

    1. Executing the above code results in a TimeoutException:

      For some reason this did not work for me. No idea why?

    1. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation.

      RDD is a way in which spark stores datasets in-memory

    1. But ideally, a client should have to know a single URI only; everything else – individual URIs, as well as recipes for constructing them e.g. in case of queries – should be communicated via hypermedia, as links within resource representations.

      Well something like http://www.example.com/user-ids/1234 this makes each of the userids a separate URI and hence accessible.

    2. REST simply means using HTTP to expose some application functionality. The fundamental and most important operation (strictly speaking, “verb” or “method” would be a better term) is an HTTP GET.

      Almost everybody is going to come with this mindset into the RESTFul world. I started here and still most times fall back to this thought process when things start to hurt my head. :)

    3. HTTP fixes them at GET, PUT, POST and DELETE (primarily, at least), and casting all of your application semantics into just these four verbs takes some getting used to. But once you’ve done that, people start using a subset of what actually makes up REST – a sort of Web-based CRUD (Create, Read, Update, Delete) architecture. Applications that expose this anti-pattern are not really “unRESTful” (if there even is such a thing), they just fail to exploit another of REST’s core concepts: hypermedia as the engine of application state.

      This thought process is pretty hard to get used to initially especially with distributed systems.