Ask HN: Which lately analysis paper blow your thoughts?

Phil Tadros · July 24, 2023

2023-07-24 07:34:25

“Overview of SHARD: A System for Highly Available Replicated Data” it’s the first paper to introduce the concept of database sharding. It was published in 1988 by the Computer Corporation of America.

It is referenced hundreds of times in many classic papers.

But, here’s the thing. It doesn’t exist.

Everyone cites Sarin, DeWitt & Rosenb[e|u]rg’s paper but none have ever seen it. I’ve emailed dozens of academics, libraries, and archives – none of them have a copy.

So it blows my mind that something so influential is, effectively, a myth.

I found LinkedIn profiles for Sunil Sarin, Mark Dewitt, and Ronni Rosenberg who all worked at CCA during this time period.

I’ve gone ahead and sent them each a message asking if they might be able to make the paper available.

If you’d like to get in contact with them yourself and are having trouble finding their LinkedIn, shoot me an email and I’ll be happy to provide you links.

Going through the bibliography of other people’s papers and theses, looking for papers that you better cite “for good luck”, or because “you gotta cite that one” is a classic PhD student behavior (I’ve done it) and it’s not terribly surprising that something like this can happen. In fact I’d expect it to be much more widespread…

I am infinitely disappointed to discover you are also the only person who seems to care about this online. I found a website, but it’s you apparently.

Now I’m going to be bugged by this too! Great trivia also a heck of a mystery

I guess I’m not too surprised, this seems like a corporate tech report. Some companies were good at having public archives of these (like Bell Labs) but I’m sure it takes a lot of resources to keep that up. It’s essentially some company’s internal Wiki page.

Do Sarin, DeWitt & Rosenb[e|u]rg exist? Are they still alive? Tracking them down and going directly to the source would seem to be the way to go. Perhaps even enlisting some “big names” in the industry to ask around?

That’s pretty interesting actually. Someone must have a copy somewhere. Seems like a real failure of scholarship if it’s truly lost, and a serious argument against walled gardens-style publishing.

I don’t know why parent comment is stirring up drama but:

1. Not available online doesn’t mean the paper’s existence is made up. It’s a very bold claim to make for the authors that they cite work that is fabricated.

From the available information, this looks like a technical report by a, probably now defunct, company back in the 80s. If this was its only form of publication, and not on some conference proceedings for example, it would be only found available on select university libraries as a physical copy. But most important,

2. This isn’t even as an impactful paper as the parent comment states. Or if its proposed concept is, the original idea is probably derived from some other paper that is indeed the one that is highly cited and most definitely available online.

Accumulative citations number from Google Scholar and IEEEXplore doesn’t exceed fifteen for the particular paper though.

https://scholar.google.com/scholar?cites=1491448744595502026…

This feels like something that would, at large scale, be unhealthy for science as a whole. While existing papers have already gone through their own quality checks, this enables bad, misleading, or false statements to propagate which can end up being a blow to the credibility of the entire model. Shouldn’t there be an ethical duty to due one’s due diligence?

> where authors see the citation in a previous paper and simply copy it into their own.

Without opening the paper to even read the abstract? To me it doesn’t sound like “innocent” at all, and borderline malpractice…

There’s a crowd who tilt the other way — if I might possibly have hinted at the idea before you then it’s borderline malpractice to not reference me. In many fields it’s common to directly reference what appear to be the bigger transitive references then, even if they didn’t directly influence this work in particular. I’d personally want to see a twidge more evidence before bringing out the pitchforks.

The result is obvious, but the question is demonstrably not. Good researchers know how to ask interesting questions that no one had bothered to ask before. Seeing clever work like this makes me reflect and continually ask myself “What cool angles am I missing?”

Integral Neural Networks (CVPR 2023 Award Candidate), a nifty way of building resizable networks.

My understanding of this work: A forward pass for a (fully-connected) layer of a neural network is just a dot product of the layer input with the layer weights, followed by some activation function. Both the input and the weights are vectors of the same, fixed size.

Let’s imagine that the discrete values that form these vectors happen to be samples of two different continuous univariate functions. Then we can view the dot product as an approximation to the value of integrating the multiplication of the two continuous functions.

Now instead of storing the weights of our network, we store some values from which we can reconstruct a continuous function, and then sample it where we want (in this case some trainable interpolation nodes, which are convoluted with a cubic kernel). This gives us the option to sample different-sized networks, but they are all performing (an approximation to) the same operation. After training with samples at different resolutions, you can freely pick your network size at inference time.

You can also take pretrained networks, reorder the weights to make the functions as smooth as possible, and then compress the network, by downsampling. In their experiments, the networks lose much less accuracy when being downsampled, compared to common pruning approaches.

Paper: https://openaccess.thecvf.com/content/CVPR2023/papers/Solods…

Code: https://github.com/TheStageAI/TorchIntegral

I thought the AlphaZero paper was pretty cool: https://arxiv.org/abs/1712.01815

Not solely did we get an entire new kind of Chess engine, it was additionally fascinating to see how the engine considered completely different openings at varied phases in its coaching. For example, the Caro-Kann, which is my weapon of selection, was favored fairly closely by it for a number of hours after which seemingly rejected (maybe it even refuted it?!) close to the top.

> … 2010. Before nobody believed that backprop can be GPU-accelerated.

When I was doing my master’s in 2004-06, I talked to a guy whose MSc thesis was about running NNs with GPUs. My thought was: you’re going to spend a TON of time fiddling with hacky systems code like CUDA, to get basically a minor 2x or 4x improvement in training time, for a type of ML algorithm that wasn’t even that useful: in that era the SVM was generally considered to be superior to NNs.

So it wasn’t that people thought it couldn’t be done, it’s that nobody saw why this would be worthwhile. Nobody was going around saying, “IF ONLY we could spend 20x more compute training our NNs, then they would be amazingly powerful”.

> First time for ML that is not deep learning

What do you mean by this? Virtually all “classic” or “shallow” ML can be GPU-accelerated, from linear regression to SVM to GBM.

Can you point me to papers with reproducible benchmarking that achieves big speedups on those?

Modern GPUs are GP-GPUs: where GP means “general purpose”: you can run any code on GPGPUs. But if you want to gain real speed-ups you will have to program in an
awkward style (“data parallel”). I am not aware of GPU acceleration of the work-horses of symbolic AI, such as Prolog, or SMT solving. There has been a lot of work on running SAT-solvers on GPUs, but I don’t think this has really succeeded so far.

I think we’re conflating two things: shallow/classic ML is not symbolic AI. I’m not sure “ML” even encompasses anything “symbolic”; I see symbolic AI and ML as subfields with little overlap.

I’m not saying symbolic AI has been GPU accelerated in the past, but that non-deep ML has been.

Good question. All supervised learning is a form of search with three components:

– Specification: what are you are looking for?

– Search space: were are you looking?

– Search mechanism: how are you going through the search space?

Program synthesis is simply learning where the search space is syntax.
In deep learning, taking the ImageNet paper as an example, the specification a bunch of photos with annotations, the search space is multi-variate real functions (encoded as matrix of floats) and the search mechanisms is
gradient descent (implemented as backprop) via a loss function.

I think this paper uses regular expressions an example of how to search fast over syntax. It claims not to be tied to regular expressions.

I recently read “Enabling tabular deep learning when d ≫ n with an auxiliary knowledge graph” (https://arxiv.org/pdf/2306.04766.pdf) for certainly one of my graduate courses. Basically, when there are considerably extra knowledge factors than options (n >> d), machine studying normally works superb (assuming knowledge high quality, an underlying relationship, and so forth.). However, for sparse datasets the place there are fewer knowledge factors than options (d >> n), most machine studying strategies fail. There’s simply not sufficient knowledge to study all of the relationships. This paper builds a information graph based mostly on relationships and different pre-existing information of knowledge options to enhance mannequin efficiency on this case. It is actually fascinating – I hadn’t realized there have been methods to get higher efficiency on this case.

“A classification of endangered high-THC cannabis (Cannabis sativa subsp. indica) domesticates and their wild relatives”

By McPartland and Small.

Moving on from cannabis sativa indica and cannabis sativa sativa to cannabis sativa indica Himalayansis and cannabis sativa indica asperrima depending on distribution from the original location of the extinct ancient cannabis wildtype.

Following this new classification, I believe there’s a third undocumented variety in North East Asia.

If anyone else has noticed the samesameification of cannabis strains and is wondering what the path forward is, this may be illuminating.

https://phytokeys.pensoft.net/article/46700/

Source Link