Ideally, each example/snippet should include a screenshot and a copy of the notebook. (A GitHub Gist works well for the latter since GitHub will format the notebook nicely.)
Great topic idea @Symbioquine! This is exciting stuff!
Cross-reference: CSV Importers in v2.x
Nice. Gonna have to dig into this. (This too)
Cheers @mstenta I’ve installed the module, just have to go and figure out what it can do for me.
This YouTube video might help (if you didn’t see it already in the CSV thread, or for newcomers who find this comment)…
The data is the
cb_2018_us_state_20m.kml file from https://www2.census.gov/geo/tiger/GENZ2018/kml/cb_2018_us_state_20m.zip
Keen as i am to see this JupyterLite API interface coming to an instance near me (i.e. myinstance.farmos.net/jupyterlite -right @mstenta ? ), i have in the meantime been exploring this alternative by Google called “Colaboratory” - first awkward experiments in this here .ipynb notebook.
AFAICT, this is essentially a Jupyter Notebook running in the browser (i.e. zero-installation on desktop), but powered by Google’s analytics engine in the cloud, so it’s like the JupyterLite example @donblair shared in that respect. All those DataSci py libes popular in DS/ML circles are there for the
'import' -e.g. ‘
requests’ @paul121 -and access to the farmOS.py library is as easy as '
!pip install farmOS==1.0.0b3', as you can see in the example linked above. It has some value-add features in the interface like a TOC outliner, and a very cool “Interactive Data Table” extension, which wraps filter/ sort/ pagination controls around the pandas dataframe. And of course (this being Google), they make it easy to share notebooks privately to anyone with a gmail, save it to gDrive, Dropbox, Gitub, gist, etc.
All that being said: in the interest of using FOSS whenever possible, i would much rather have my farmOS API interface in Jupyterlite, even without those extra Google features. Just so long as i can run farmOS.py in it- including that
'requests' library, which is i gather essential -then i will happily to adopt JupyterLite as my API interface, just as soon as it lands in Farmier.
The only thing that’s necessary to get that to work with Farmier-hosted instances is we need to “bless” (via CORS config) the
edgecollective.io domain in your instance so that your browser “trusts” the third-party URL. I did that already for yours, so you should be good to go!
(If anyone else is using Farmier and wants to play around with https://edgecollective.io/jupyterlite/ ping me!)
Ultimately, it will be great to include the new Drupal JupyterLite module that @Symbioquine and I started on Farmier - but that’s not strictly necessary. The only benefit is you don’t need to configure CORS because it’s serving JupyterLite from the same domain, so your browser will automatically “trust” it.
@walt Just be aware of the way “file storage” works with JupyterLite, so you don’t lose anything! All files you upload or create in JupyterLite are stored IN YOUR BROWSER SESSION … they are NOT stored on the farmOS server. If you clear your browser cache or change browsers, they are gone!
We have some fun ideas to store/serve files from farmOS itself being discussed here:
Thanks, @mstenta -since you configured my instance to accept API calls from that URL- i did have a serious play with this. Works just fine for API access- nice! But when i upload a .CSV to that same directory in which the .ipynb resides, and try to do anything with it, it fails to find the file. Have tried every way i could think to get around this, and consulted @donblair about the problem… No joy
Good cautionary note, thanks -but not a problem at this point, as my browser cannot lose what it cannot even find in the first place! In fact the browser finds & uploads the file just fine; it’s jLite that can’t seem to find it.
Heh: i’ll bet that “some [confused] users” Don mentioned in first line of his problem statement was inspired by yours truly <8-)…
For now, you need this bit of magic that I’ve included in some of my examples above;
# From https://gist.github.com/bollwyvl/132aaff5cdb2c35ee1f75aed83e87eeb async def get_contents(path): """use the IndexedDB API to acess JupyterLite's in-browser (for now) storage for documentation purposes, the full names of the JS API objects are used. see https://developer.mozilla.org/en-US/docs/Web/API/IDBRequest """ import js, asyncio DB_NAME = "JupyterLite Storage" # we only ever expect one result, either an error _or_ success queue = asyncio.Queue(1) IDBOpenDBRequest = js.self.indexedDB.open(DB_NAME) IDBOpenDBRequest.onsuccess = IDBOpenDBRequest.onerror = queue.put_nowait await queue.get() if IDBOpenDBRequest.result is None: return None IDBTransaction = IDBOpenDBRequest.result.transaction("files", "readonly") IDBObjectStore = IDBTransaction.objectStore("files") IDBRequest = IDBObjectStore.get(path, "key") IDBRequest.onsuccess = IDBRequest.onerror = queue.put_nowait await queue.get() return IDBRequest.result.to_py() if IDBRequest.result else None
If you put that in the top cell of your notebook (and run it first), then you can access the contents of an “uploaded” (to the browser storage) file named “my_file.csv” (as a
str object) with;
csv_str = (await get_contents("my_file.csv"))["content"]
Or if you need a file object you can wrap that in io.StringIO. e.g.
file_obj = io.StringIO((await get_contents("my_file.csv"))["content"])
Yeah, I think that’s one of the rough edges that will get resolved as JupyterLite moves out of alpha. Maybe as part of jupyterlite/jupyterlite#315.
Thanks @Symbioquine, but… I did put the long magic spell atop the notebook, have tried both of the calls you suggested below it, and neither seems to work. You can see from the below screenshot that my CROPS.csv for upload is in the same directory as the .ipynb… And you can see the error msg when i try the first method. Can you make any sense of this?
I think I see two problems;
- If the cell at the top of the screenshot is the “long magic spell” you copied, I don’t think you got the whole thing. The last line should start with “return IDBRequest”
- I think you need to remove the quotes from
df=pd.read_csv("csv_str"). I think it should be
Thanks @Symbioquine ; fixed the two errors you cite, but i still get the same error-
TypeError: 'NoneType' object is not subscriptable -on line 2, where the “csv_str” variable is defined (not on line 4, where it is invoked -now w/o the quotes).
Got any more insight into what that error msg might actually refer to? (could be many things, according to google, that would trigger this “most common exception in python”)
PS: problem solved, using this method, following that longer script above in the previous block:
import io file_obj = io.StringIO((await get_contents("myfilename.csv"))["content"]) df=pd.read_csv(file_obj)
note to fellow n00bs: you can’t just refer to yourfilename.csv in the script, even tho it’s in the same directory as the .ipynb ; you should grab the path via right-click on the file in Jupyterlite file browser, because it needs an absolute reference (was nested in the farmos/ directory, in this case).
Thanks a heap, @Symbioquine , for all the help it took to find my mistake!
In the course of my stumbling around yesterday, related to this CSV upload challenge, i came to a deeper realisation of what you wrote yesterday @mstenta, i.e.:
Just be aware of the way “file storage” works with JupyterLite, so you don’t lose anything! All files you upload or create in JupyterLite are stored IN YOUR BROWSER SESSION … they are NOT stored on the farmOS server. If you clear your browser cache or change browsers, they are gone!
Yes; in fact, to debug my problem, i had to switch from Chrome to Firefox, and of course the subject file had to be re-uploaded… Which put me to wonder about a coupla things, in context of this UseCase:
- A primary benefit of this application architecture is having the freedom to work from any machine- e.g. farm office and/or home -but, given the diffs that will inevitably arise in state across those two machines, how can we mitigate the confusion that will consequently arise?
- Memory management: This little (26kb) CSV is small enough to be of no concern… But this being a workflow we plan to run at least weekly- and sometimes on much larger files -what might be the negative impact on browser/system performance, and what should be done to mitigate that problem?
- Given these (and other?) limitations, a JupyterLite NB that references files is not suitable for sharing a replicable result -whether in the interest of tech support (as @Symbioquine and i experienced yesterday) or in the larger context of Replicable Data Science.
Obviously i don’t understand this technology enough… So i did a little digging yesterday, from which i gathered that localStorage is a subtype of Web Storage (f.k.a. DOM Storage), along with two other forms that are more familiar to me. What complicates matters further is the different ways in which browser-makers implement the standard (that’s the point at which my head started to hurt, so i quit digging), but i did find this little table (pictured below) that helped me to understand essential similarities & diffs.
Bottom-line: There’s enough deep voodoo about this stuff that- to avoid sliding into even deeper doodoo! -i think it will be wise to store any files referenced in the JupyterLite NB in an online archive, and link them explicitly in the document.
A more up-to-date table from that article.
Otherwise, I think you’re making some great points - most of which don’t have concrete answers.
I will say though that many of those issues are mitigated just by changing how we think of the “storage” in JupyterLite. If we consider the storage in JupyterLite like a sandbox environment or temporary work area, then we can treat those things as advantages.
I would argue that it is better for both the tech support and replicable data science scenarios.
- I was able to start with a clean slate and bring in just the files I needed to try and reproduce the problem you were having.
- I was able to modify the files without any risk losing data or breaking things for anyone else.
- It forces me to be intentional about sharing just the versions/changes which are important.
- Conversely, it helps ensure the collective workspace isn’t increasingly littered with semi-relevant experiments.
I would also argue that one of the foundations of truly replicable data science is going to be consistent and disciplined use of version control technologies like Git to manage the source code - and in some cases sample data. Perhaps in the future JupyterLite could help with that part, but in a way it’s kind of beautiful that it doesn’t. It’s job is just to be a place that’s reproducible (but not replicated) between users to run some scripts in a little more interactive environment than a text editor and a command line.
Thanks @Symbioquine for providing a more nuanced perspective. I can see how what i was inclined to view as bugs might be considered features… Just so long as we (a) treat it as a “sandbox,” and (b) employ that “more consistent and disciplined use of version control tech” for both sources and sample data.
Also: since you had me open Developer Tools yesterday (in search of that CROPS.csv file), and as that article you linked explains in more detail (illustrated by screenshot below), i can now easily navigate to where these files are stored (IndexedDB indeed, in both Chrome and Firefox), confirm the keys and drill down into values… But only by navigating the JSON tree, in which form my nice tidy tables has been rendered.
One day i hope to get over my allergy to these so-deeply-nested JSON trees; still, being more of a rows&columns kinda guy, i have to ask: is there any easy way to translate this JSON tree back into tabular .CSV form?