Show HN: My demo for vector embeddings for the Earth's surface



[deleted by user]

This is amazing!


Dude, please provide context on the site. I have no clue what I'm looking at or its purpose. Not trying to poo poo on it, just want context.


It’s highlighting similar areas to the area currently under the cursor.


Similar how? Geographically? Climate? Population?


Depends on the data fed to the model. Probably all of the above and many more.


Sorry! The presentation could be better. I'll work on the FAQ.


Chris - just saw your presentation of this at PNNL, awesome seeing it pop up on HN too!


Cool! Glad you got to see it working and that presentation was a nice reason to make sure everything was cleaned up.


Is the presentation of this work available online? Would love to watch or read through!


Moved the center to SF and I've been sitting, watching the spinner.

Some documentation would be helpful.


I've seen the same thing, querying SF hangs for some reason. And so does Cascais in Portugal. It works in San Mateo and Lisbon though


I think I can see what's going on here - I used a shapefile with the boundaries of the world's countries (and their coastlines) which had some geometric simplification applied to it. This file was used to mask out any water (for which the embedding model won't do much), but I think that the simplification process snapped the coastline too far inland, leaving some points on land which were masked out erroneously.


Very nice!


Seems to not handle the ocean well.


It's due to the fact that they used satellite imagery to create the embeddings. The map is just for visualization. They probably used 5 or more bands of the satellite data which means each pixel is going to be slightly different due to things like depth, amount of silt in the water, amount of plankton....

Having worked on these types of problems before the model is doing a pretty great job matching pixels.


Thanks! And you are giving it too much credit here - it's just trained on one-hot encoded land cover (24 classes) from Copernicus. Using imagery directly would be # 2 on my list of to-dos after including elevation in the input data.


Oh so did you run the CNN on pre-classified data


I intentionally avoided using lots of ocean areas - this way I cut down the number of required sites for inference from ~100 million (at resolution 7 in the H3 system) to around 25 million.


I've had to build out some version of a geospatial vector embedding / latent variable dataset for at least 4 separate projects now. Come see the viewer I've built on top of it!

The embeddings come from globally available Copernicus land cover data.


How did you generate the embeddings. The vectors are relatively small for all the embedding I have seen built from image and nlp models.

Which copernicus bands were you using? Did you augment the data with DEM info?


The embeddings were obtained using a CNN triplet loss model (~10M parameters) on the Copernicus land cover data. I haven't used DEM data yet but I have done generative modeling on DEMs in other work and would like to do that too:


Looks like Copernicus updates yearly? I can't tell if they include elevation from the "technical" tab on their home page.

Having originally come from the world of geointelligence, let me tell you this is not an easy problem to solve. For rural land use, this is probably fairly reliable, but depending on the granularity of change detection you want, cities are often building new neighborhoods in the span of months, large construction projects finish, human movement happens more in the span of hours or even minutes, and that's just for land. If you want maritime tracking, you need nearly continuous updates. We managed to do it for the Navy, but the infrastructure required for this is immense, much of the sensor technology is classified and not even available for commercial use, and the resource requirements not remotely practical for a personal side project.

Of course, military intelligence is primarily trying to track the land use of other militaries, especially in active theaters of operations, and that changes even more frequently than regular places where people aren't constantly erecting and moving temporary headquarters, living under camouflage cover, and blowing up existing infrastructure.

I guess you're doing this for peacetime domestic real estate, like neighborhood X in city Y is similarity ranked against neighborhood U in city V? Are you incorporating pricing and demographic data or just land use? It seems to me like neighbors make the neighborhood, as much or more than qualities of the land. Along with things like usability of the sidewalks, responsiveness and level of disrepair of the roads, crime rates, level of visible homelessness, air quality, vehicular traffic congestion.

I don't want to shit on the approach too much. Usefulness is determined by the results you get, but given the heterogeneity of the data here, some of it ordinal, some of it nominal, discrete versus continuous, irreconciability of scaling and dimensional analysis, not necessarily coming from similar distributions if you tried to just z-score it all, I can think of ways using pure numerical voodoo to put them all into the same vector space, but the statistical validity of doing this is dubious at best.


Can you explain what I’m looking at? I don’t know how to interpret the hex tiles :-)


Great question. A legend or brief description of the underlying logic / heuristic would be helpful.


The heuristic is likely the result of an ML algorithm, so the underlying logic may not make much sense to us.


I'm pretty sure I'm not the intended audience but I also have no idea what this is used for. Surveying? Real estate tycoons? Oil & gas exploration?


It's a way to encode land to make predictions of it. E.g. is the land arable, is it rural, how similar is it to X, etc. Embeddings help encode data in formats more usable by ML models.


The question was: in what context do people need to answer a question like "which geographical points are close to X and similar to X"?

I don't understand who the target audience is and what this can be used for.


My guess is this site is simply a way to explore the embeddings. People make similar data visualization tools for word embeddings, so that's what I assumed this was.


The original idea came from something I saw at work - we needed a way to build generic feature sets representing something about real estate, but beyond the data we had on prices, floors, and other house-specific details.


Sure, I get that part -- but then how do people use the predictions?


The embeddings are used by algorithms, not people, generally. You could ask something like "what's the most similar place to X within Y", and it would using the embeddings (which cover a variety of facts) to calculate answer. An embedding is an N dimensional vector (where the dimensions may or may not be meaningful to us), and similarity can be implemented by looking at the similarity between vectors.


Yup, and while the similarity search is perhaps the most visually appealing way to work with it, the real use (in my opinion) is in providing generic sets of geospatial features which are reusable across applications. I've built out versions of H3-referenced feature sets at each of the jobs I've had over the last 10 years.


Sure! The basic idea is that each hexagon is a discrete unit of space for which I obtain a vector embedding. This vector is supposed to represent a sort of data-based summary of that location, obtained in this case using deep learning.

When you put the search on a hex, it looks up the vector for that hex and then performs a similarity search on all other vectors within the circle and shows the ones which are most similar in terms of land cover. The dependence on land cover / land use data is just because that was easy to get.

As other folks have pointed out here, raw satellite imagery is also a potential input source for this. I'm playing around with other sources and really want to integrate something like GeoVex ( into the embeddings as well.


Would it provide useful/interesting results if the similarity search was global? E.g. find me neighborhoods in London most similar to this one in Chicago?




This tool looks very interesting, and seems to work well, but being utterly unfamiliar with geospatial vector embeddings, their purpose or use, I had no idea what I was looking at, or why.

It seems to show areas of similarity, within a radius of a central query location, with regard to (perhaps) vegetation cover (e.g., forests, grasslands, wetlands), artificial surfaces (e.g., urban areas, roads), agricultural areas, water bodies, etc, overlayed on Google Maps, and allows exporting of the embeddings for lat/lons as cvs. It looks like land features for hexagonal grid areas have been turned into points in a 15 dimensional space, and some sort of nearest-neighbor search is done to return most similar other grid areas within the larger area. It does indeed seem accurate in my area!

I'm not sure what this would be useful for, but I'm assuming urban planning, real estate, agriculture or conservation? I know I'm not the target audience, but more info or ideas would be fascinating.


You pretty much hit the nail on the head. The application areas you mentioned are the same as the ones that I had in mind when developing this.


this is more like math "eyecandy" in the present state

source: professional urban planning in California

[deleted by user]