Coding Blocks

Allen Underwood, Michael Outlaw, Joe Zack

Podcast about computer programming and software development so you can learn on the go.

  • Nuts and Bolts of Apache Kafka

    Topics, Partitions, and APIs oh my! This episode we’re getting further into how Apache Kafka works and its use cases. Also, Allen is staying dry, Joe goes for broke, and Michael (eventually) gets on the right page.

    The full show notes are available on the website at https://www.codingblocks.net/episode236

    News

    • Thanks for the reviews! angingjellies and Nick Brooker
      • Please leave us a review! (/review)
    • Atlanta Dev Con is coming up, on September 7th, 2024 (www.atldevcon.com)

    Kafka Topics

    • They are partitioned – this means they are distributed (or can be) across multiple Kafka brokers into “buckets”
    • New events written to Kafka are appended to partitions
      • The distribution of data across brokers is what allows Kafka to scale so well as data can be written to and read from many brokers simultaneously
    • Events with the same key are written to the same partition as the original event
      • Kafka guarantees reads of events within a partition are always read in the order that they were written
    • For fault tolerance and high availability, topics can be replicated…even across regions and data centers
      • NOTE: If you’re using a cloud provider, know that this can be very costly as you pay for inbound and outbound traffic across regions and availability zones
      • Typical replication configurations for production setups are 3 replicas

    Kafka APIS

    • Admin API – used for managing and inspecting topics, brokers, and other Kafka objects
    • Producer API – used to write events to Kafka topics
    • Consumer API – used to read data from Kafka topics
    • Kafka Streams API – the ability to implement stream processing applications/microservices. Some of the key functionality includes functions for transformations, stateful operations like aggregations, joins, windowing, and more
      • In the Kafka streams world, these transformations and aggregations are typically written to other topics (in from one topic, out to one or more other topics)
      • Kafka Connect API – allows for the use of reusable import and export connectors that usually connect external systems. These connectors allow you to gather data from an external system (like a database using CDC) and write that data to Kafka. Then you could have another connector that could push that data to another system OR it could be used for transforming data in your streams application
        • These connectors are referred to as Sources and Sinks in the connector portfolio (confluent.io)
        • Source – gets data from an external system and writes it to a Kafka topic
        • Sink – pushes data to an external system from a Kafka topic

    Use Cases

    • Message queue – usually talking about replacing something like ActiveMQ or RabbitMQ
      ** Message brokers are often used for responsive types of processing, decoupling systems, etc. – Kafka is usually a great alternative that scales, generally has faster throughput, and offers more functionality
    • Website activity tracking – this was one of the very first use cases for Kafka – the ability to rebuild user actions by recording all the user activities as events
    • How and why Kafka was developed (LinkedIn)
      • Typically different activity types would be written to different topics – like web page interactions to one topic and searches to another
    • Metrics – aggregating statistics from distributed applications
    • Log aggregation – some use Kafka for storage of event logs rather than using something like HDFS or a file server or cloud storage – but why? Because using Kafka for the event storage abstracts away the events from the files
    • Stream processing – taking events in and further enriching those events and publishing them to new topics
    • Event sourcing – using Kafka to store state changes from an application that are used to replay the current state of an object or system
    • Commit log – using Kafka as an external commit log is a way for synchronizing data between distributed systems, or help rebuild the state in a failed system

    Tip of the Week

    • Rémi Gallego is a music producer who makes music under a variety of names like The Algorithm and Boucle Infini, almost all of it is instrumental Synthwave with a hard-rock edge. They also make a lot of video game music, including 2 of my favorite game soundtracks of all time “The Last Spell” and “Hell is for Demons” (YouTube)
    • Did you know that the Kubernetes-focused TUI we’ve raved about before can be used to look up information about other things as well, like :helm and :events. Events is particularly useful for figuring out mysteries. You can see all the “resources” available to you with “?”. You might be surprised at everything you see (pop-eye, x-ray, and monitoring)
    • WarpStream is an S3 backed, API compliant Kafka Alternative. Thanks MikeRg! (warpstream.com)
    • Cloudflare’s trillion message Kafka setup, thanks Mikerg! (blog.bytebytego.com)
    • Want the power and flexibility of jq, but for yaml? Try yq! (gitbook.io)
    • Zenith is terminal graphical metrics for your *nix system written in Rust, thanks MikeRg! (github.com)
    • 8 Big (O)Notation Every Developer should Know (medium.com)
    • Another Git cheat sheet (wizardzines.com)
    9 June 2024, 10:55 pm
  • Intro to Apache Kafka

    We finally start talking about Apache Kafka! Also, Allen is getting acquainted with Aesop, Outlaw is killing clusters, and Joe was paying attention in drama class.

    The full show notes are available on the website at https://www.codingblocks.net/episode235

    News

    Intro to Apache Kafka

    What is it?

    Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

    Core capabilities

    • High throughput – Deliver messages at network-limited throughput using a cluster of machines with latencies as low as 2ms.
    • Scalable – Scale production clusters up to a thousand brokers, trillions of messages per day, petabytes of data, and hundreds of thousands of partitions. Elastically expand and contract storage and processing
    • Permanent storage – Store streams of data safely in a distributed, durable, fault-tolerant cluster.
    • High availability – Stretch clusters efficiently over availability zones or connect separate clusters across geographic regions.

    Ecosystem

    • Built-in stream processing – Process streams of events with joins, aggregations, filters, transformations, and more, using event-time and exactly-once processing.
    • Connect to almost anything – Kafka’s out-of-the-box Connect interface integrates with hundreds of event sources and event sinks including Postgres, JMS, Elasticsearch, AWS S3, and more.
    • Client libraries – Read, write, and process streams of events in a vast array of programming languages
    • Large ecosystem of open source tools – Large ecosystem of open source tools: Leverage a vast array of community-driven tooling.

    Trust and Ease of Use

    • Mission critical – Support mission-critical use cases with guaranteed ordering, zero message loss, and efficient exactly-once processing.
    • Trusted by thousands of organizations – Thousands of organizations use Kafka, from internet giants to car manufacturers to stock exchanges. More than 5 million unique lifetime downloads.
    • Vast user community – Kafka is one of the five most active projects of the Apache Software Foundation, with hundreds of meetups around the world.

    What is it?

    • Getting data in real-time from event sources like databases, sensors, mobile devices, cloud services, applications, etc. in the form of streams of events. Those events are stored “durably” (in Kafka) for processing, either in real-time or retrospectively, and then routed to various destinations depending on your needs. It’s this continuous flow and processing of data that is known as “streaming data”
      How can it be used? (some examples)
    • Processing payments and financial transactions in real-time
    • Tracking automobiles and shipments in real time for logistical purposes
    • Capture and analyze sensor data from IoT devices or other equipment
    • To connect and share data from different divisions in a company

    Apache Kafka as an event streaming platform?

    • It contains three key capabilities that make it a complete streaming platform
      • Can publish and subscribe to streams of events
      • Can store streams of events durably and reliably for as long as necessary (infinitely if you have the storage)
      • To process streams of events in real-time or retrospectively
    • Can be deployed to bare metal, virtual machines or to containers on-prem or in the cloud
    • Can be run self-managed or via various cloud providers as a managed service

    How does Kafka work?

    • A distributed system that’s composed of servers and clients that communicate using a highly performant TCP protocol

    Servers

    • Kafka runs as a cluster of one or more servers that can span multiple data centers or cloud regions
    • Brokers – these are a portion of the servers that are the storage layer
    • Kafka Connect – these are servers that constantly import and export data from existing systems in your infrastructure such as relational databases
    • Kafka clusters are highly scalable and fault-tolerant

    Clients

    • Allows you to write distributed applications that allow to read, write and process streams of events in parallel that are fault-tolerant and scale
      • These clients are available in many programming languages – both the ones provided by the core platform as well as 3rd party clients

    Concepts

    Events

    • It’s a record of something that happened – also called a “record” in the documentation
      • Has a key
      • Has a value
      • Has an event timestamp
      • Can have additional metadata

    Producers and Consumers

    • Producers – these are the client applications that publish/write events to Kafka
    • Consumers – these are the client applications that read/subscribe to events from Kafka
    • Producers and consumers are completely decoupled from each other

    Topics

    • Events are stored in topics
    • Topics are like folders on a file system – events would be the equivalent of files within that folder
    • Topics are mutli-producer and multi-subscriber
      • There can be zero, one or many producers or subscribers to a topic that write to or read from that topic respectively
    • Unlike many message queuing systems, these events can be read from as many times as necessary because they are not deleted after being consumed
      • Deleting of messages is handled on a per topic configuration that determines how long events are retained
      • Kafka’s performance is not dependent on the amount of data nor the duration of time data is stored, so storing for longer periods is not a problem

    Resources we Like

    • Why Strimzi moved away from statefulsets (github.com)

    Tip of the Week

    • Flipper Zero is a multi-functional interaction device mixed with a Tamagotchi. It has a variety of IO options built in, RFID, NFC, GPIO, Bluetooth, USB, and a variety of low-voltage pins like you’d see on an Arduino. Using the device upgrades the dolphin, encouraging you to try new things…and it’s all open-source with a vibrant community behind it. (shop.flipperzero.one)
    • Kafka Tui?! Kaskade is a cool-looking Kafka TUI that has got to be better than using the scripts in the build folder that comes with Kafka. (github.com/sauljabin/kaskade)
    • Microstudio is a web-based integrated development environment for making simple games and it’s open source! (microstudio.dev)
    • Bing Copilot has a number of useful prompts (bing.com)
      • Designer (photos)
      • Vacation Planner
      • Cooking assistant
      • Fitness trainer
    • Sharing metrics between projects in GCP, Azure, and maybe AWS???
    • Checking wifi in your home – Android Only (play.google.com)
    • Powering POE without running cables (Amazon)
    • Omada specific – cloud vs local hardware (Amazon)
    • How to “shutdown” a Kafka cluster in Kubernetes:
      • kubectl annotate kafka my-kafka-cluster strimzi.io/pause-reconciliation="true" --context=my-context --namespace=my-namespace
      • kubectl delete strimzipodsets my-kafka-cluster --context=my-context --namespace=my-namespace
      • Then to “restart” the cluster: kubectl annotate kafka my-kafka-cluster strimzi.io/pause-reconciliation- --context=my-context --namespace=my-namespace
    https://github.com/strimzi/proposals/blob/main/031-statefulset-removal.md
    26 May 2024, 11:55 pm
  • StackOverflow AI Disagreements, Kotlin Coroutines and More

    Joe Zack was on a brief holiday so Allen and Michael took over the helm for an episode. What would a new episode be without a little something regarding AI, some more love for Kotlin, and a number of excellent tips throughout (as well as at the end of) the episode.

    Reviews

    • iTunes: ivan.kuchin

    News

    Atlanta Dev Con
    September 7th, 2024
    https://www.atldevcon.com/

    Topics

    Please leave us a review!
    https://www.codingblocks.net/review

    Random Bits

    Tip of the Week

    Docker Blog is pretty excellent

    Car Research

    Utilizing wood sheet goods by utilizing cut lists

    Docker’s chicken-n-egg problem

    Download the file using the server suggested name With wget …
    --content-disposition
    https://man7.org/linux/man-pages/man1/wget.1.html

    Wth curl …
    -JO
    -J, –remote-header-name
    -O, –remote-name
    https://curl.se/docs/manpage.html#-J

    13 May 2024, 2:15 am
  • Llama 3 is Here, Spending Time on Environmental Setup and More
    Coding Blocks Episode 233Coding Blocks Episode 233

    In this episode Joe introduces us to more security items you should be aware of in the world of CWE’s, Michael bends to the will of Joe and Allen in his favorite portion of the show, and Allen pontificates on the time spent setting up IDE’s and environments.

    Reviews – Thank You!

    • iTunes: Vlad Bezden, Mom in VA, Make1977
    • Spotify: chutney3000, Xuraith

    Upcoming Events

    Topics

    Open Telemetry

    CNCF – Cloud Native Computing Foundation

    Llama 3 – the next version of Meta’s AI engine

    • “Now available with both 8B and 70B pretrained and instruction-tuned versions to support a wide range of applications”
      https://llama.meta.com/llama3/

    Environmental concerns over the processing required for AI

    Setting up IDE’s and environments

    • IDE vs old school debugging
    • Setup can require a significant amount of time
      • Is it worth it?
      • What if you’re just working on a bug?

    Security Resources

    Tips

    Pre-warning – probably wouldn’t recommend installing this!

    Saw a cool Windows utility called “Windrecorder” that records video and text from your desktop, and lets you rewind and search.

    MacOS’s Spotlight is more powerful than you maybe knew
    https://www.intego.com/mac-security-blog/spotlight-secrets-15-ways-to-use-spotlight-on-your-mac/ 
    https://beebom.com/spotlight-tips-tricks/

    If you’re grep command isn’t working like you thought it should, you might be a victim of content getting kicked out of the buffer
    grep --line-buffered

    iOS – get text from images
    https://support.apple.com/guide/iphone/use-live-text-iphcf0b71b0e/ios

    28 April 2024, 11:55 pm
  • Ktor, Logging Ideas, and Plugin Safety

    Picture, if you will, a nondescript office space, where time seems to stand still as programmers gather around a water cooler. Here, in the twilight of the workday, they exchange eerie tales of programming glitches, security breaches, and asynchronous calls. Welcome to the Programming Zone, where reality blurs and (silent) keystrokes echo in the depths of the unknown. Also, Allen is ready to boom, Outlaw is not happy about these category choices, and Joe takes the easy (but not longest) road.

    The full show notes are available on the website at https://www.codingblocks.net/episode232

    News

    • Thanks for the reviews! Want to help us out? Leave a review! (/reviews)
      • ivan.kuchin, Nick Brooker, Szymon, JT, Scott Harden
    • Text replacements are tricky, replacing links to “twitter.com” with “x.com” enabled a wave of domain spoofing attacks. (arstechnica.com)

    Around the Water Cooler

    • Ktor is an asynchronous web framework based on Kotlin, but can it compete with Spring? (ktor.io)
    • docker init is a great tool for getting started, but how much can you expect from a scaffolding tool? (docs.docker.com)
    • Logging, how much is too much? What if we could go back in time?
    • Boomer Hour: Let’s talk about GChat UX
    • What do you know about browser extensions?
    • Can you trust any extensions?
    • Bookmarklets still rock! (freecodecamp.org)
    • Silent Key Tester for mechanical keyboards, you can specify a wide variety of switches (thockking.com)
      • Joe’s preferences:
        • Durock Shrimp Silent T1
        • Tactile Gazzew Boba U4 Silent
        • Liner Kailh Silent Brown
        • Linear Lichicx Lucy Silent
        • Linear WS Wuque Studio Gray Silent
        • Tactile WS Wuque Studio
        • White Silent – Linear
        • Tactile Kailh Silent Pink
        • Linear Cherry MX Silent Red

    Tip of the Week

    • Feeling nostalgic for the original GameBoy or GameBoy Color? GBStudio is a one-stop shop for making games, it’s open-source and fully featured. You can do the art, music, and programming all in one tool and it’s thoughtfully laid out and well-documented. Bonus…you games will work in GameBoy emulators AND you can even produce your own working physical copies. (If you don’t want the high-level tools you can go old skool with “GBDK” too) (gbstudio.dev)
    • If you’re going to do something, why not script it? If you’re going to script it, save it for next time!
    • Dave’s Garage is a YouTube channel that does deep dives into Windows internals, cool electronics projects, and everything in between! (YouTube)
    14 April 2024, 11:55 pm
  • Importance of Data Structures, Bad Documentation and Comments and More
    Episode 231 Artwork - Moogey's Dog in SlackEpisode 231 Artwork - Moogey's Dog in Slack

    In this episode, Allen, Joe and Michael finally make it back to record together! Allen revisits the basics, Michael kicks off boomer hour nicely, and JZ let’s us know that the dream of an 8-bit looking keyboard is not dead.

    News

    Topics

    Tips

    • Remember Carl Schweitzer from MS Dev Show? He’s got a new pod cast, The “Cloud Chat”, talking about cloud everything…like episode 1 about the aas’ of cloud computing!
      https://podcasts.apple.com/us/podcast/cloudchat/id1734938265
    • Joe has another music suggestion for you, this time it’s a new album by Four Tet. If you’re not familiar with Four Tet, it’s often described as “IDM” or intelligent dance music. It’s slower and more experimental than what you’d hear in a club though it still has those steady beats to help you get in the zone.
      https://open.spotify.com/album/7mpTSR6E855VhdCeoPgpCF
      https://music.apple.com/us/album/three/1729585296
    • Sometimes Google’s GCP API’s don’t seem to tell the truth
    • See what your helm-templates will render using this online tool
      https://helm-playground.com
    • Some useful Java JVM settings when working with containers
      • XX:+UseContainerSupport this one tells the container to use all the available resources – this way the JVM benefits from the CPU / Memory allocated to the container
      • XX:InitialRAMPercentage=80.0 this one tells the JVM to use 80% of the RAM for the initial heap size – this is based off the container memory LIMIT
      • XX:MaxRAMPercentage=80.0 this one tells the JVM to use 80% of the RAM for the MAX heap size – this is based off the container memory LIMIT
      • XX:MaxDirectMemorySize based off reading, if NOT SET, this should default to the same as the Max Heap Size – which is better than what we were doing previously – previously we had this set to 256m which is smaller than some of the larger files we get from the CDS and was causing OOM issues.
    1 April 2024, 2:03 am
  • Decorating your Home Office

    This time we are missing the “ocks”, but we hope you enjoy this off…ice topic chat about personalizing our workspaces. Also, Joe had to put a quarter in the jar, and Outlaw needs a cookie.

    The full show notes are available on the website at https://www.codingblocks.net/episode230

    News

    Thank you for the review Szymon! Want to leave us a review?

    Decorating your Home Office

    • Joe’s Uplift Desk Review
    • Mounting monitors, is there any other way?
    • To grommet or not to grommet?
    • How many keys do you want on your keyboard?
    • Wired vs Wireless
    • About that “fn” key…
    • Reddit for inspiration?
    • Office-Appropriate Art
      • Paintings
      • Prints / Silk Screens / Photography
      • Sculptures
      • Book Cases
      • There’s a story for Outlaw about this print: https://www.johndyerbaizley.com/product/four-horsemen-full-color-ap

    Tip of the Week

    • If you have a car, you should consider getting a Mirror Dash Cam. It’s a front and rear camera system that replaces your rearview mirror with a touchscreen. Impress all your friends with your recording, zoom, night vision, parking assistance, GPS, and 24/7 recording and monitoring. (Amazon)
    • Be careful about exercising after you give blood, else you might end up needing it back! (redcrossblood.org )
    The Cloud Nine Ergonomics Keyboard looks pretty nice…

    John Dyer Baizley does some really cool stuff, including artwork for some of our favorite bands
    18 March 2024, 12:54 am
  • Multi-Value, Spatial, and Event Store Databases

    We are mixing it up on you again, no Outlaw this week, but we can offer you some talk of exotic databases. Also, Joe pronounces everything correctly and Allen leaves you with a riddle.

    The full show notes are available on the website at https://www.codingblocks.net/episode229

    News

    • Thanks for the reviews!
      • ivan.kuchin (has taken the lead!), Yoondoggy, cykoduck, nehoraigold
      • Want to help us out? Leave a review! (reviews)

    Multivalue DBMS

    • Popular: 86. Adabas, 87. UniData/UniVerse, 147. JBase
    • Similar to RDBMS – store data in tables
      • Store multiple values to a particular record’s attribute
        • Some RDBMS’s can do this as well, BUT it’s typically an exception to the rule when you’d store an array on an attribute
        • In a MultiValue DBMS – that’s how you SHOULD do it
        • Part of the reason it’s done this way is these database systems are not optimized for JOINS
      • Looked at the Adabas and UniData sites – the primary selling points seem to be rapid application development / ease of learning and getting up to speed as well as data modeling that closely mirrors your application data structures
    • I BELIEVE it’s a schema on write (docs.rocketsoftware.com)
    • Supposed to be very performant as you access the data the way your application expects it
    • Per the docs, it’s easy to maintain (Wikipedia)

    Spatial DBMS

    • Popular: 29. PostGIS, 59. Aerospike, 136. SpatiaLite
    • Provides the ability to efficiently store, modify, and query spatial data – data that appears in a geometrical space (maps, polygons, etc)
    • Generally have custom data types for storing the spatial data
    • Indices that allow for quick retrieval of spatial data about other spatial data
    • Also allow for performing spatial-specific operations on data, such as computing distances, merging or intersecting objects or even calculating areas
    • Geospatial data is a subset of spatial data – they represent places / spatial data on the Earth’s surface
    • Spatio-temporal data is another variation – spatial data combined with timestamps
    • PostGIS – basically a plugin for PostgreSQL that allows for storing of spatial data
      • Additionally supports raster data – data for things like weather and elevation
      • If you want to learn how to use it and understand the data and what’s stored (postgis.net)
        • Spatial data types are: point, line, polygon, and more…basically shapes
        • Rather than using b-tree indexes for sorting data for fast retrieval, spatial indexes that are bounding boxes – rectangles that identify what is contained within them
          • Typically accomplished with R-Tree and Quadtree implementations
          • RedFin – a real estate competitor to realtor.com and others, uses PostgreSQL / PostGIS
          • Quite a bit of software that supports OpenGIS so may be a good place to start if you’re interested in storing/querying spatial data

    Event Stores

    • Popular: 178. EventStoreDB, 336. IBM DB2 Event Store, 338. NEventStore
    • Used for implementing the concept of Event Sourcing
      • Event Sourcing – an application/data store where the current state of an object is obtained by “replaying” all the events that got it to its current state
        • This contrasts with RDBMS’s in that relational typically store the current state of an object – historical state CAN be stored, but that’s an implementation detail that has to be implemented, such as temporal tables in SQL Server or “history tables”
      • Only support adding new events and querying the order of events
        • Not allowed to update or delete an event
        • For performance reasons, many Event Store databases support snapshots for holding materialized states at points in time
    • EventStoreDB – https://www.eventstore.com/eventstoredb
      • Defined as an “immutable log”
      • Features: guaranteed writes, concurrency model, granulated stream and stream APIs
      • Many client interfaces: .NET, Java, Go, Node, Rust, and Python
      • Runs on just about all OSes – Windows, Mac, Linux
      • Highly available – can run in a cluster
      • Optimistic concurrency checks that will return an error if a check fails
      • “Projections” allow you to generate new events based off “interesting” occurrences in your existing data
      • For example. You are looking for how many Twitter users said “happy” within 5 minutes of the word “foo coffee shop” and within 2 minutes of saying “London”.
      • Highly performant – 15k writes and 50k reads per second

    Resources we like

    Tip of the Week

    • If your internet connection is good, but your cell phone service is bad then you might want to consider Ooma. Ooma sells devices that plug into your network or connect wireless and provide a phone number, and a phone jack so you can hook up an an old school home telephone. We’ve using it for about a week now with no problems and it’s been a breeze to set up. The devices range from $99 to $129 and there’s a monthly “premier” plan you can buy with nifty features like a secondary phone line, advanced call blocking, and call forwarding. (ooma.com)
    • Why use “git reset –hard” when you can “git stash -u” instead? Reset is destructive, but stashing keeps your changes just in case you need them. Because sometimes, your “sometimes” is now!
      • 🚫 “git reset –hard”.
      • ✅ “git stash -u”


    4 March 2024, 1:51 am
  • Overview of Object Oriented, Wide Column, and Vector Databases

    We have a different combination of the hosts for this episode where we continue the series on the types of database systems available and why you might choose one over another. Michael continues impressing by recalling everything we’ve ever said on our 500+ hours of podcasts, Allen enjoys learning about a database system he’d never come across, and Joe is loaded up and ready for his trek to Georgia, USA.

    Reviews

    • iTunes: Calum55555
    • Spotify: Ian Neethling, Ghostmerc, Xuraith
    • Audible: Wood2prog

    News

    Orlando Code Camp
    https://orlandocodecamp.com/

    Object Oriented DBMS

    Wide Column Stores

    • Popular: 12. Cassandra, 26. HBase, 27. Azure Cosmos DB
    • Also known as extensible record stores
      https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
    • Can hold extremely large numbers of dynamic columns
      • How much is a large number – “a record can have billions of columns” – which is why they’re also described as two-dimensional key/value stores
    • Schema on read
    • Wide column stores should not be confused with columnar storage in RDBMS – the latter is an implementation detail inside a relational database system that imroves OLAP type of performance by storing data column by column rather than record by record
    • Using Cassandra as the information – https://cassandra.apache.org/_/cassandra-basics.html
      • Hyper-horizontally scalable
        • Prevents data loss due to hardware failures (if scaled)
      • Ability to tweak throughput of reads or writes in isolation
        https://www.codingblocks.net/podcast/search-driven-apps/
      • It’s “distributed” manner means it runs on many nodes but it looks like a single point of entry
      • No real point of running a single node of Cassandra
      • “Masterless” architecture – every node in a cluster acts like every other node
        https://www.codingblocks.net/podcast/designing-data-intensive-applications-secondary-indexes-rebalancing-routing/
      • In contrast with traditional RDMBS – can be scaled on low-cost, commodity hardware – don’t need super-high-end motherboards that support terrabytes of ram to scale
      • Linear scalability – every node you add gives you + n throughput
        https://www.datastax.com/products/datastax-astra
      • Replication is handled by tweaking replication factors – ie how many times you want the data replicated in order to stay in a good state
      • Per query configurable consistency – how many nodes must acknowledge the read/write query before returning a success

    Vector DBMS

    • Popular: 52. Kdb, 103. Pinecone, 139. Chroma
    • A database system that specializes in storing vector embeddings and being able to retrieve them quickly
      • What is a vector embedding?
        • https://www.pinecone.io/learn/vector-embeddings-for-developers/
        • What is a vector? A mathematical structure with a size and a direction
          • Think of it as a point in space (on a graph) with the direction being the arrow from (0,0,0) to the vector point
          • They say for developers, it’s easier to think of vectors as an array of numbers
          • When you look at the vectors in space, some will be floating by themselves while others might be clustered closely to each other
        • Vectors are very useful in Machine Learning algorithms because CPUs and GPUs are very good at doing math
        • Vector Embeddings is the process of converting virtually any data structure into vectors
        • It’s not as simple as just a straight conversion
          • You don’t want to lose the original data’s “meaning”
            • An example they used was comparing two sentences – you wouldn’t just compare the words, you want to compare if the two sentences had the same meaning
            • To keep the meaning and produce vectors with relationships that make sense, that requires embedding models
          • Nowadays, many embedding models are created by passing large sets of “labeled” data to neural networks
            https://en.wikipedia.org/wiki/Neural_network
            • Neural networks are trained using supervised learning (usually), they can also be self-supervised or unsupervised learning
              • Using a supervised model, you pass in large sets of data as pairs of inputs and labeled outputs
              • The values are transformed in each layer of the neural network
              • With each training of the neural network, the activations at each layer are modified
              • The goal is that eventually the neural network will be able to provide an output for any given input, even if it hasn’t seen that specific input before
            • The embedding model is essentially those layers of the neural network minus the last one that was labeling data – rather than getting labeled data you get a vector embedding
          • They have a great visualization on the pinecone page showing the output of a word2vec embedding model that shows how words would appear in this 3d vectror space
          • This is what an embedding model does – it can take inputs and know where to place them in “vector space”
            • Items placed closer together are more related, and further apart, less related
    • Ok, so now we know what vector embeddings are, what can we do with them?
      • Semantic search – rather than having search engines be able to search for words that are similar to what you entered, they can now search for content with meaning similar to what you searched for
      • Question answering applications
      • Audio search
    • Check out the page of sample applications – https://docs.pinecone.io/page/examples

    Resources

    Tips of the Week

    19 February 2024, 12:55 am
  • Picking the Right Database Type – Tougher than You Think

    You asked, we listened! A request from one of our Slack channels was to go over the various types of databases and why you might choose one over another. Join us in another information filled episode where Joe won’t be attending the event he’s been promoting and Allen tries to keep his voice together for the entirety of the episode, and almost succeeded.

    News

    Reviews

    • iTunes: ivan.kuchin, MikeW717
    • Spotify: Darren Pruitt, chutney3000

    Upcoming Events

    Miscellaneous

    • Kudos to Dell Support on their monitors
    • The Cat 8 journey will be beginning soon
    • Home offices – random desires

    Database Types

    Primary resource we used

    Some terminology we’ll be using

    • Schema on write – the schema for the data is determined before writing the record
    • Schema on read – the schema of the data is understood by the client using the data

    Relational DBMS

    • Popular – 1. Oracle, 2. mySQL, 3. Microsoft SQL Server, 4. PostgreSQL, 8. IBM DB2, 9. Snowflake, 11. Microsoft Access
    • Schema on write
    • Primary language / form of access is SQL
    • Schema is defined by named tables with named columns and specific data types
    • Data exists as rows in the table that conform to the columns/types that are defined in the schema
    • Scalability – typically vertical scaling (increasing available CPU/RAM) is the preferred way
    • Can be very performant but requires knowledge on how to index and store data properly
      • Even with excellent design and indexing, performance can suffer as size of data grows
    • Some fun Instragram posts on scaling their databases

    Key-value stores

    • Popular: 6. Redis, 15. Amazon Dynamo DB, 27. Azure Cosmos DB, 35. Memcached, 54. etcd
    • Schema on read
    • No real language – usually an API to put and get documents
    • Depending on the key value store, complex data structures may be stored and ability to query in various ways
    • Scalability – horizontally scalable – massively
    • Very performant
    • Many have built in extended functionality beyond looking up by a single key – for instance, Redis allows search engine type of filtering
    • Why’s Hadoop not on the list? 
      https://db-engines.com/en/blog_post/16

    Document Stores

    • Popular: 5. MongoDB, 15. Amazon DynamoDB, 17. Databricks, 27. Azure Cosmos DB, 34. Couchbase
    • Schema on read
    • DBMS specific querying – usually offer a SQL capability but often times is not the most powerful way to query the data
    • Documents do not need to conform to any schema
      • Multiple documents in the same collection can have completely different fields/properties, OR they have have the same properties with different data types
      • Documents can contain collections in fields or even nest other documents
      • Typically stores data in JSON like documents
    • Can be very performant but may require care to create proper indexes, manage connections, etc

    Time Series DBMS

    Graph DBMS

    Search engine

    • Popular: 7. Elasticsearch, 14. Splunk, 24. Solr, 40. OpenSearch, 58. MarkLogic
    • Extensions of NoSQL databases
    • Schema on read
    • Complex search expressions
    • Full text search
    • Stemming – reducing words to their root forms so that searches can be more accurate with similar word searches
    • Ranking and grouping of search results
    • Built for scalability
    • Incredibly performant for the use case
    • Not great with relationship data
    • Why choose over something like a relational or document database?

    Resources

    Tips of the Episode

    5 February 2024, 12:50 am
  • There is still cool stuff on the internet
    This episode we are talking about keeping the internet interesting and making cool things by looking at PagedOut and Itch.io. Also, Allen won’t ever mark you down, Outlaw won’t ever give you up, and Joe took a note to say something about Barbie here but he can’t remember what it was. The full show notes […]
    21 January 2024, 5:33 pm
  • More Episodes? Get the App
© MoonFM 2024. All rights reserved.