The data pipeline has multiple stages, and each can and will naturally break or slow down because of hardware failures or misconfiguration. And when that happens, there is just too much data to be able to buffer it all for very long. Eventually some will get dropped, causing gaps in analytics and a degraded product experience unless proper mitigations are in place.
How does one retain valuable information from more than half a billion events per second, when some must be dropped? Drop it in a controlled way, by downsampling.
Here is a visual analogy showing the difference between uncontrolled data loss and downsampling. In both cases the same number of pixels were delivered. One is a higher resolution view of just a small portion of a popular painting, while the other shows the full painting, albeit blurry and highly pixelated.
\n \n \n
As we noted above, any point in the pipeline can fail, so we want the ability to downsample at any point as needed. Some services proactively downsample data at the source before it even hits Logfwdr. This makes the information extracted from that data a little bit blurry, but much more useful than what otherwise would be delivered: random chunks of the original with gaps in between, or even nothing at all. The amount of "blur" is outside our control (we make our best effort to deliver full data), but there is a robust way to estimate it, as discussed in the next section.
Logfwdr can decide to downsample data sitting in the buffer when it overflows. Logfwdr handles many data streams at once, so we need to prioritize them by assigning each data stream a weight and then applying max-min fairness to better utilize the buffer. It allows each data stream to store as much as it needs, as long as the whole buffer is not saturated. Once it is saturated, streams divide it fairly according to their weighted size.
In our implementation (Go), each data stream is driven by a goroutine, and they cooperate via channels. They consult a single tracker object every time they allocate and deallocate memory. The tracker uses a max-heap to always know who the heaviest participant is and what the total usage is. Whenever the total usage goes over the limit, the tracker repeatedly sends the "please shed some load" signal to the heaviest participant, until the usage is again under the limit.
The effect of this is that healthy streams, which buffer a tiny amount, allocate whatever they need without losses. But any lagging streams split the remaining memory allowance fairly.
We downsample more or less uniformly, by always taking some of the least downsampled batches from the buffer (using min-heap to find those) and merging them together upon downsampling.
\n \n \n
Merging keeps the batches roughly the same size and their number under control.
Downsampling is cheap, but since data in the buffer is compressed, it causes recompression, which is the single most expensive thing we do to the data. But using extra CPU time is the last thing you want to do when the system is under heavy load! We compensate for the recompression costs by starting to downsample the fresh data as well (before it gets compressed for the first time) whenever the stream is in the "shed the load" state.
We called this approach "bottomless buffers", because you can squeeze effectively infinite amounts of data in there, and it will just automatically be thinned out. Bottomless buffers resemble reservoir sampling, where the buffer is the reservoir and the population comes as the input stream. But there are some differences. First is that in our pipeline the input stream of data never ends, while reservoir sampling assumes it ends to finalize the sample. Secondly, the resulting sample also never ends.
Let's look at the next stage in the pipeline: Logreceiver. It sits in front of a distributed queue. The purpose of logreceiver is to partition each stream of data by a key that makes it easier for Logpush, Analytics inserters, or some other process to consume.
Logreceiver proactively performs adaptive sampling of analytics. This improves the accuracy of analytics for small customers (receiving on the order of 10 events per day), while more aggressively downsampling large customers (millions of events per second). Logreceiver then pushes the same data at multiple resolutions (100%, 10%, 1%, etc.) into different topics in the distributed queue. This allows it to keep pushing something rather than nothing when the queue is overloaded, by just skipping writing the high-resolution samples of data.
The same goes for Inserters: they can skip reading or writing high-resolution data. The Analytics APIs can skip reading high resolution data. The analytical database might be unable to read high resolution data because of overload or degraded cluster state or because there is just too much to read (very wide time range or very large customer). Adaptively dropping to lower resolutions allows the APIs to return some results in all of those cases.
Okay, we have some downsampled data in the analytical database. It looks like the original data, but with some rows missing. How do we make sense of it? How do we know if the results can be trusted?
Let's look at the math.
Since the amount of sampling can vary over time and between nodes in the distributed system, we need to store this information along with the data. With each event $x_i$ we store its sample interval, which is the reciprocal to its inclusion probability $\\pi_i = \\frac{1}{\\text{sample interval}}$. For example, if we sample 1 in every 1,000 events, each of the events included in the resulting sample will have its $\\pi_i = 0.001$, so the sample interval will be 1,000. When we further downsample that batch of data, the inclusion probabilities (and the sample intervals) multiply together: a 1 in 1,000 sample from a 1 in 1,000 sample is a 1 in 1,000,000 sample of the original population. The sample interval of an event can also be interpreted roughly as the number of original events that this event represents, so in the literature it is known as weight $w_i = \\frac{1}{\\pi_i}$.\n\nWe rely on the Horvitz-Thompson estimator (HT, paper) in order to derive analytics about $x_i$. It gives two estimates: the analytical estimate (e.g. the population total or size) and the estimate of the variance of that estimate. The latter enables us to figure out how accurate the results are by building confidence intervals. They define ranges that cover the true value with a given probability (confidence level). A typical confidence level is 0.95, at which a confidence interval (a, b) tells that you can be 95% sure the true SUM or COUNT is between a and b.\n
So far, we know how to use the HT estimator for doing SUM, COUNT, and AVG.
Given a sample of size $n$, consisting of values $x_i$ and their inclusion probabilities $\\pi_i$, the HT estimator for the population total (i.e. SUM) would be\n\n$$\\widehat{T}=\\sum_{i=1}^n{\\frac{x_i}{\\pi_i}}=\\sum_{i=1}^n{x_i w_i}.$$\n\nThe variance of $\\widehat{T}$ is:\n\n$$\\widehat{V}(\\widehat{T}) = \\sum_{i=1}^n{x_i^2 \\frac{1 - \\pi_i}{\\pi_i^2}} + \\sum_{i \\neq j}^n{x_i x_j \\frac{\\pi_{ij} - \\pi_i \\pi_j}{\\pi_{ij} \\pi_i \\pi_j}},$$\n\nwhere $\\pi_{ij}$ is the probability of both $i$-th and $j$-th events being sampled together.\n\nWe use Poisson sampling, where each event is subjected to an independent Bernoulli trial (\"coin toss\") which determines whether the event becomes part of the sample. Since each trial is independent, we can equate $\\pi_{ij} = \\pi_i \\pi_j$, which when plugged in the variance estimator above turns the right-hand sum to zero:\n\n$$\\widehat{V}(\\widehat{T}) = \\sum_{i=1}^n{x_i^2 \\frac{1 - \\pi_i}{\\pi_i^2}} + \\sum_{i \\neq j}^n{x_i x_j \\frac{0}{\\pi_{ij} \\pi_i \\pi_j}},$$\n\nthus\n\n$$\\widehat{V}(\\widehat{T}) = \\sum_{i=1}^n{x_i^2 \\frac{1 - \\pi_i}{\\pi_i^2}} = \\sum_{i=1}^n{x_i^2 w_i (w_i-1)}.$$\n\nFor COUNT we use the same estimator, but plug in $x_i = 1$. This gives us:\n\n$$\\begin{align}\n\\widehat{C} &= \\sum_{i=1}^n{\\frac{1}{\\pi_i}} = \\sum_{i=1}^n{w_i},\\\\\n\\widehat{V}(\\widehat{C}) &= \\sum_{i=1}^n{\\frac{1 - \\pi_i}{\\pi_i^2}} = \\sum_{i=1}^n{w_i (w_i-1)}.\n\\end{align}$$\n\nFor AVG we would use\n\n$$\\begin{align}\n\\widehat{\\mu} &= \\frac{\\widehat{T}}{N},\\\\\n\\widehat{V}(\\widehat{\\mu}) &= \\frac{\\widehat{V}(\\widehat{T})}{N^2},\n\\end{align}$$\n\nif we could, but the original population size $N$ is not known, it is not stored anywhere, and it is not even possible to store because of custom filtering at query time. Plugging $\\widehat{C}$ instead of $N$ only partially works. It gives a valid estimator for the mean itself, but not for its variance, so the constructed confidence intervals are unusable.\n\nIn all cases the corresponding pair of estimates are used as the $\\mu$ and $\\sigma^2$ of the normal distribution (because of the central limit theorem), and then the bounds for the confidence interval (of confidence level ) are:\n\n$$\\Big( \\mu - \\Phi^{-1}\\big(\\frac{1 + \\alpha}{2}\\big) \\cdot \\sigma, \\quad \\mu + \\Phi^{-1}\\big(\\frac{1 + \\alpha}{2}\\big) \\cdot \\sigma\\Big).$$
We do not know the N, but there is a workaround: simultaneous confidence intervals. Construct confidence intervals for SUM and COUNT independently, and then combine them into a confidence interval for AVG. This is known as the Bonferroni method. It requires generating wider (half the "inconfidence") intervals for SUM and COUNT. Here is a simplified visual representation, but the actual estimator will have to take into account the possibility of the orange area going below zero.
\n \n \n
In SQL, the estimators and confidence intervals look like this:
\n
WITH sum(x * _sample_interval) AS t,\n sum(x * x * _sample_interval * (_sample_interval - 1)) AS vt,\n sum(_sample_interval) AS c,\n sum(_sample_interval * (_sample_interval - 1)) AS vc,\n -- ClickHouse does not expose the erf⁻¹ function, so we precompute some magic numbers,\n -- (only for 95% confidence, will be different otherwise):\n -- 1.959963984540054 = Φ⁻¹((1+0.950)/2) = √2 * erf⁻¹(0.950)\n -- 2.241402727604945 = Φ⁻¹((1+0.975)/2) = √2 * erf⁻¹(0.975)\n 1.959963984540054 * sqrt(vt) AS err950_t,\n 1.959963984540054 * sqrt(vc) AS err950_c,\n 2.241402727604945 * sqrt(vt) AS err975_t,\n 2.241402727604945 * sqrt(vc) AS err975_c\nSELECT t - err950_t AS lo_total,\n t AS est_total,\n t + err950_t AS hi_total,\n c - err950_c AS lo_count,\n c AS est_count,\n c + err950_c AS hi_count,\n (t - err975_t) / (c + err975_c) AS lo_average,\n t / c AS est_average,\n (t + err975_t) / (c - err975_c) AS hi_average\nFROM ...
\n
Construct a confidence interval for each timeslot on the timeseries, and you get a confidence band, clearly showing the accuracy of the analytics. The figure below shows an example of such a band in shading around the line.
We started using confidence bands on our internal dashboards, and after a while noticed something scary: a systematic error! For one particular website the "total bytes served" estimate was higher than the true control value obtained from rollups, and the confidence bands were way off. See the figure below, where the true value (blue line) is outside the yellow confidence band at all times.
\n \n \n
We checked the stored data for corruption, it was fine. We checked the math in the queries, it was fine. It was only after reading through the source code for all of the systems responsible for sampling that we found a candidate for the root cause.
We used simple random sampling everywhere, basically "tossing a coin" for each event, but in Logreceiver sampling was done differently. Instead of sampling randomly it would perform systematic sampling by picking events at equal intervals starting from the first one in the batch.
\n \n \n
Why would that be a problem?
There are two reasons. The first is that we can no longer claim $\\pi_{ij} = \\pi_i \\pi_j$, so the simplified variance estimator stops working and confidence intervals cannot be trusted. But even worse, the estimator for the total becomes biased. To understand why exactly, we wrote a short repro code in Python:\n \n
import itertools\n\ndef take_every(src, period):\n for i, x in enumerate(src):\n if i % period == 0:\n yield x\n\npattern = [10, 1, 1, 1, 1, 1]\nsample_interval = 10 # bad if it has common factors with len(pattern)\ntrue_mean = sum(pattern) / len(pattern)\n\norig = itertools.cycle(pattern)\nsample_size = 10000\nsample = itertools.islice(take_every(orig, sample_interval), sample_size)\n\nsample_mean = sum(sample) / sample_size\n\nprint(f"{true_mean=} {sample_mean=}")
\n
After playing with different values for pattern and sample_interval in the code above, we realized where the bias was coming from.
Imagine a person opening a huge generated HTML page with many small/cached resources, such as icons. The first response will be big, immediately followed by a burst of small responses. If the website is not visited that much, responses will tend to end up all together at the start of a batch in Logfwdr. Logreceiver does not cut batches, only concatenates them. The first response remains first, so it always gets picked and skews the estimate up.
\n \n \n
We checked the hypothesis against the raw unsampled data that we happened to have because that particular website was also using one of the Logs products. We took all events in a given time range, and grouped them by cutting at gaps of at least one minute. In each group, we ranked all events by time and looked at the variable of interest (response size in bytes), and put it on a scatter plot against the rank inside the group.
\n \n \n
A clear pattern! The first response is much more likely to be larger than average.
We fixed the issue by making Logreceiver shuffle the data before sampling. As we rolled out the fix, the estimation and the true value converged.
\n \n \n
Now, after battle testing it for a while, we are confident the HT estimator is implemented properly and we are using the correct sampling process.
\n
\n
Using Cloudflare's analytics APIs to query sampled data
We already power most of our analytics datasets with sampled data. For example, the Workers Analytics Engine exposes the sample interval in SQL, allowing our customers to build their own dashboards with confidence bands. In the GraphQL API, all of the data nodes that have "Adaptive" in their name are based on sampled data, and the sample interval is exposed as a field there as well, though it is not possible to build confidence intervals from that alone. We are working on exposing confidence intervals in the GraphQL API, and as an experiment have added them to the count and edgeResponseBytes (sum) fields on the httpRequestsAdaptiveGroups nodes. This is available under confidence(level: X).
The query above asks for the estimates and the 95% confidence intervals for SUM(edgeResponseBytes) and COUNT. The results will also show the sample size, which is good to know, as we rely on the central limit theorem to build the confidence intervals, thus small samples don't work very well.
The response shows the estimated count is 96947, and we are 95% confident that the true count lies in the range 96874.24 to 97019.76. Similarly, the estimate and range for the sum of response bytes are provided.
The estimates are based on a sample size of 96294 rows, which is plenty of samples to calculate good confidence intervals.
We have discussed what kept our data pipeline scalable and resilient despite doubling in size every 1.5 years, how the math works, and how it is easy to mess up. We are constantly working on better ways to keep the data pipeline, and the products based on it, useful to our customers. If you are interested in doing things like that and want to help us build a better Internet, check out our careers page.
"],"published_at":[0,"2025-01-27T14:00+00:00"],"updated_at":[0,"2025-01-27T14:00:02.609Z"],"feature_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/75NKwswtRHDDkYNKWPeKlr/8ecd9aba303acb2a48b2087226dab070/BLOG-2486_1.png"],"tags":[1,[[0,{"id":[0,"2MJs149fz91foS5KPlD4hr"],"name":[0,"Bugs"],"slug":[0,"bugs"]}],[0,{"id":[0,"2OotqBxtRdi5MuC90AlyxE"],"name":[0,"Analytics"],"slug":[0,"analytics"]}],[0,{"id":[0,"5fXI7jwkVL8rNyKrfpk0Lw"],"name":[0,"Data"],"slug":[0,"data"]}],[0,{"id":[0,"6GfYtlrIy3UkkZHclLS3LX"],"name":[0,"GraphQL"],"slug":[0,"graphql"]}],[0,{"id":[0,"1pPf2NNj9SXrC0A0ERKp9v"],"name":[0,"SQL"],"slug":[0,"sql"]}],[0,{"id":[0,"KDI5hQcs301H8vxpGKXO0"],"name":[0,"Go"],"slug":[0,"go"]}],[0,{"id":[0,"2UVIYusJwlvsmPYl2AvSuR"],"name":[0,"Deep Dive"],"slug":[0,"deep-dive"]}],[0,{"id":[0,"64P6m4UbJfOqbY74LBs8GN"],"name":[0,"Sampling"],"slug":[0,"sampling"]}]]],"relatedTags":[0],"authors":[1,[[0,{"name":[0,"Constantin Pan"],"slug":[0,"constantin-pan"],"bio":[0],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ICl6tt16EVQTzZBDsUY8e/ea13d3c440b722fdcd261a9240105222/_tmp_mini_magick20220928-42-1vrdnmo.jpg"],"location":[0,"London"],"website":[0],"twitter":[0],"facebook":[0],"publiclyIndex":[0,true]}],[0,{"name":[0,"Jim Hawkridge"],"slug":[0,"jim-hawkridge"],"bio":[0],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2Exc0INWMvhkHxGvQWUYlK/bf19c60e186ba3457df51302858959fe/_tmp_mini_magick20220928-42-1isuz89.jpg"],"location":[0],"website":[0],"twitter":[0],"facebook":[0],"publiclyIndex":[0,true]}]]],"meta_description":[0,"Here we explain how we made our data pipeline scale to 700 million events per second while becoming more resilient than ever before. We share some math behind our approach and some of the designs of "],"primary_author":[0,{}],"localeList":[0,{"name":[0,"blog-english-only"],"enUS":[0,"English for Locale"],"zhCN":[0,"No Page for Locale"],"zhHansCN":[0,"No Page for Locale"],"zhTW":[0,"No Page for Locale"],"frFR":[0,"No Page for Locale"],"deDE":[0,"No Page for Locale"],"itIT":[0,"No Page for Locale"],"jaJP":[0,"No Page for Locale"],"koKR":[0,"No Page for Locale"],"ptBR":[0,"No Page for Locale"],"esLA":[0,"No Page for Locale"],"esES":[0,"No Page for Locale"],"enAU":[0,"No Page for Locale"],"enCA":[0,"No Page for Locale"],"enIN":[0,"No Page for Locale"],"enGB":[0,"No Page for Locale"],"idID":[0,"No Page for Locale"],"ruRU":[0,"No Page for Locale"],"svSE":[0,"No Page for Locale"],"viVN":[0,"No Page for Locale"],"plPL":[0,"No Page for Locale"],"arAR":[0,"No Page for Locale"],"nlNL":[0,"No Page for Locale"],"thTH":[0,"No Page for Locale"],"trTR":[0,"No Page for Locale"],"heIL":[0,"No Page for Locale"],"lvLV":[0,"No Page for Locale"],"etEE":[0,"No Page for Locale"],"ltLT":[0,"No Page for Locale"]}],"url":[0,"https://blog.cloudflare.com/how-we-make-sense-of-too-much-data"],"metadata":[0,{"title":[0,"Over 700 million events/second: How we make sense of too much data"],"description":[0,"Here we explain how we made our data pipeline scale to 700 million events per second while becoming more resilient than ever before. We share some math behind our approach and some of the designs of our systems, as well as issues that we had to face and bugs we had to chase."],"imgPreview":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3IbyM2qrP1QNY2qYgOLZo8/f235ea2206c509938d943cdef8a084c8/BLOG-2486_OG.png"]}],"publicly_index":[0,true]}],"translations":[0,{"posts.by":[0,"By"],"footer.gdpr":[0,"GDPR"],"lang_blurb1":[0,"This post is also available in {lang1}."],"lang_blurb2":[0,"This post is also available in {lang1} and {lang2}."],"lang_blurb3":[0,"This post is also available in {lang1}, {lang2} and {lang3}."],"footer.press":[0,"Press"],"header.title":[0,"The Cloudflare Blog"],"search.clear":[0,"Clear"],"search.filter":[0,"Filter"],"search.source":[0,"Source"],"footer.careers":[0,"Careers"],"footer.company":[0,"Company"],"footer.support":[0,"Support"],"footer.the_net":[0,"theNet"],"search.filters":[0,"Filters"],"footer.our_team":[0,"Our team"],"footer.webinars":[0,"Webinars"],"page.more_posts":[0,"More posts"],"posts.time_read":[0,"{time} min read"],"search.language":[0,"Language"],"footer.community":[0,"Community"],"footer.resources":[0,"Resources"],"footer.solutions":[0,"Solutions"],"footer.trademark":[0,"Trademark"],"header.subscribe":[0,"Subscribe"],"footer.compliance":[0,"Compliance"],"footer.free_plans":[0,"Free plans"],"footer.impact_ESG":[0,"Impact/ESG"],"posts.follow_on_X":[0,"Follow on X"],"footer.help_center":[0,"Help center"],"footer.network_map":[0,"Network Map"],"header.please_wait":[0,"Please Wait"],"page.related_posts":[0,"Related posts"],"search.result_stat":[0,"Results {search_range} of {search_total} for {search_keyword}"],"footer.case_studies":[0,"Case Studies"],"footer.connect_2024":[0,"Connect 2024"],"footer.terms_of_use":[0,"Terms of Use"],"footer.white_papers":[0,"White Papers"],"footer.cloudflare_tv":[0,"Cloudflare TV"],"footer.community_hub":[0,"Community Hub"],"footer.compare_plans":[0,"Compare plans"],"footer.contact_sales":[0,"Contact Sales"],"header.contact_sales":[0,"Contact Sales"],"header.email_address":[0,"Email Address"],"page.error.not_found":[0,"Page not found"],"footer.developer_docs":[0,"Developer docs"],"footer.privacy_policy":[0,"Privacy Policy"],"footer.request_a_demo":[0,"Request a demo"],"page.continue_reading":[0,"Continue reading"],"footer.analysts_report":[0,"Analyst reports"],"footer.for_enterprises":[0,"For enterprises"],"footer.getting_started":[0,"Getting Started"],"footer.learning_center":[0,"Learning Center"],"footer.project_galileo":[0,"Project Galileo"],"pagination.newer_posts":[0,"Newer Posts"],"pagination.older_posts":[0,"Older Posts"],"posts.social_buttons.x":[0,"Discuss on X"],"search.icon_aria_label":[0,"Search"],"search.source_location":[0,"Source/Location"],"footer.about_cloudflare":[0,"About Cloudflare"],"footer.athenian_project":[0,"Athenian Project"],"footer.become_a_partner":[0,"Become a partner"],"footer.cloudflare_radar":[0,"Cloudflare Radar"],"footer.network_services":[0,"Network services"],"footer.trust_and_safety":[0,"Trust & Safety"],"header.get_started_free":[0,"Get Started Free"],"page.search.placeholder":[0,"Search Cloudflare"],"footer.cloudflare_status":[0,"Cloudflare Status"],"footer.cookie_preference":[0,"Cookie Preferences"],"header.valid_email_error":[0,"Must be valid email."],"search.result_stat_empty":[0,"Results {search_range} of {search_total}"],"footer.connectivity_cloud":[0,"Connectivity cloud"],"footer.developer_services":[0,"Developer services"],"footer.investor_relations":[0,"Investor relations"],"page.not_found.error_code":[0,"Error Code: 404"],"search.autocomplete_title":[0,"Insert a query. Press enter to send"],"footer.logos_and_press_kit":[0,"Logos & press kit"],"footer.application_services":[0,"Application services"],"footer.get_a_recommendation":[0,"Get a recommendation"],"posts.social_buttons.reddit":[0,"Discuss on Reddit"],"footer.sse_and_sase_services":[0,"SSE and SASE services"],"page.not_found.outdated_link":[0,"You may have used an outdated link, or you may have typed the address incorrectly."],"footer.report_security_issues":[0,"Report Security Issues"],"page.error.error_message_page":[0,"Sorry, we can't find the page you are looking for."],"header.subscribe_notifications":[0,"Subscribe to receive notifications of new posts:"],"footer.cloudflare_for_campaigns":[0,"Cloudflare for Campaigns"],"header.subscription_confimation":[0,"Subscription confirmed. Thank you for subscribing!"],"posts.social_buttons.hackernews":[0,"Discuss on Hacker News"],"footer.diversity_equity_inclusion":[0,"Diversity, equity & inclusion"],"footer.critical_infrastructure_defense_project":[0,"Critical Infrastructure Defense Project"]}]}" ssr="" client="load" opts="{"name":"PostCard","value":true}" await-children="">
Here we explain how we made our data pipeline scale to 700 million events per second while becoming more resilient than ever before. We share some math behind our approach and some of the designs of ...
We realized that we need a way to automatically heal our platform from an operations perspective, and designed and built a workflow orchestration platform to provide these self-healing capabilities ...
At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts....
As TLS 1.3 was ratified earlier this year, I was recollecting how we got started with it here at Cloudflare. We made the decision to be early adopters of TLS 1.3 a little over two years ago. It was a very important decision, and we took it very seriously....
The idea behind graceful upgrades is to swap out the configuration and code of a process while it is running, without anyone noticing it. If this sounds error-prone, dangerous, undesirable and in general a bad idea – I’m with you....
As we enable more ARM64[1] machines in our network, I want to give some technical insight into the process we went through to reach software parity in our multi-architecture environment....
At Cloudflare we like Go. We use it in many in-house software projects as well as parts of bigger pipeline systems. But can we take Go to the next level and use it as a scripting language for our favourite operating system, Linux?...
Not long ago I needed to benchmark the performance of Golang on a many-core machine. I took several of the benchmarks that are bundled with the Go source code, copied them, and modified them to run on all available threads....
If you’re a web dev / devops / etc. meetup group that also works toward building a faster, safer Internet, I want to support your awesome group by buying you pizza. ...
One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market. Until recently I mostly played with Intel hardware. ...
Ubuntu us are doing the round trip! It’s time to live - WAN you arrive at GHC, come meet us and say HELO (we love GNU faces, we’ll be very api to meet you). When you’re exhausted like IPv4, git over to the Cloudflare corner to reboot....
Recently we launched an internal monthly Go Hack Night at our San Francisco office, open to anyone who works at Cloudflare regardless of their department or position. ...
When writing an HTTP server or client in Go, timeouts are amongst the easiest and most subtle things to get wrong: there’s many to choose from, and a mistake can have no consequences for a long time, until the network glitches and the process hangs....
Go native vendoring (a.k.a. GO15VENDOREXPERIMENT) allows you to freeze dependencies by putting them in a vendor folder in your project. The compiler will then look there before searching the GOPATH....
The Go test coverage implementation is quite ingenious: when asked to, the Go compiler will preprocess the source so that when each code portion is executed a bit is set in a coverage bitmap....
It's well known that we're heavy users of the Go programming language at CloudFlare. Our work often involves delving into the standard library source code to understand internal code paths, error handling and performance characteristics....
Hi, I'm Filippo and today I managed to surprise myself! (And not in a good way.) I'm developing a new module ("filter" as we call them) for RRDNS, CloudFlare's Go DNS server. ...